Introduction to Open Data Science - Course Project

About the project

Welcome to my page

This is the learning diary for the course ‘Introduction to Open Data Science 2020’. I would like to further develop my data science skills in R and get familiar with Github.
My GitHub repository is https://github.com/Imangholiloo/IODS-project "

# Let's get the date and time
date()
## [1] "Wed Nov 25 19:29:42 2020"

alternatively, print(Sys.time())

# Let's say hello to you via R!.
for  (i in "Hello anyone"){
  print (i)}
## [1] "Hello anyone"
# load package
library(rmarkdown)


————————————————————————————

Mohammad Imangholiloo
20201105

Exercise 2: Linear regression analysis

Part 1) Data wrangling

Read the data to R

lrn14 <- read.table("http://www.helsinki.fi/~kvehkala/JYTmooc/JYTOPKYS3-data.txt", sep="\t", header=TRUE)

alternatively, read it from your local drive lrn14 <- read.table("C:/Users/Mohammad/Documents/IODS-project/data/JYTOPKYS3-data.txt", sep="\t", header=TRUE) Or read it from your clipboard using: lrn14 <- readClipboard() metadata (description of data) is here (in Finnish): https://www.mv.helsinki.fi/home/kvehkala/JYTmooc/JYTOPKYS2-meta.txt

FYI: to keep characters as characters in R: use stringsAsFactors=FALSE also in the read file command

# check dimensions of the data
dim(lrn14)
## [1] 183  60
# check structure of the data
str(lrn14)
## 'data.frame':    183 obs. of  60 variables:
##  $ Aa      : int  3 2 4 4 3 4 4 3 2 3 ...
##  $ Ab      : int  1 2 1 2 2 2 1 1 1 2 ...
##  $ Ac      : int  2 2 1 3 2 1 2 2 2 1 ...
##  $ Ad      : int  1 2 1 2 1 1 2 1 1 1 ...
##  $ Ae      : int  1 1 1 1 2 1 1 1 1 1 ...
##  $ Af      : int  1 1 1 1 1 1 1 1 1 2 ...
##  $ ST01    : int  4 4 3 3 4 4 5 4 4 4 ...
##  $ SU02    : int  2 2 1 3 2 3 2 2 1 2 ...
##  $ D03     : int  4 4 4 4 5 5 4 4 5 4 ...
##  $ ST04    : int  4 4 4 4 3 4 2 5 5 4 ...
##  $ SU05    : int  2 4 2 3 4 3 2 4 2 4 ...
##  $ D06     : int  4 2 3 4 4 5 3 3 4 4 ...
##  $ D07     : int  4 3 4 4 4 5 4 4 5 4 ...
##  $ SU08    : int  3 4 1 2 3 4 4 2 4 2 ...
##  $ ST09    : int  3 4 3 3 4 4 2 4 4 4 ...
##  $ SU10    : int  2 1 1 1 2 1 1 2 1 2 ...
##  $ D11     : int  3 4 4 3 4 5 5 3 4 4 ...
##  $ ST12    : int  3 1 4 3 2 3 2 4 4 4 ...
##  $ SU13    : int  3 3 2 2 3 1 1 2 1 2 ...
##  $ D14     : int  4 2 4 4 4 5 5 4 4 4 ...
##  $ D15     : int  3 3 2 3 3 4 2 2 3 4 ...
##  $ SU16    : int  2 4 3 2 3 2 3 3 4 4 ...
##  $ ST17    : int  3 4 3 3 4 3 4 3 4 4 ...
##  $ SU18    : int  2 2 1 1 1 2 1 2 1 2 ...
##  $ D19     : int  4 3 4 3 4 4 4 4 5 4 ...
##  $ ST20    : int  2 1 3 3 3 3 1 4 4 2 ...
##  $ SU21    : int  3 2 2 3 2 4 1 3 2 4 ...
##  $ D22     : int  3 2 4 3 3 5 4 2 4 4 ...
##  $ D23     : int  2 3 3 3 3 4 3 2 4 4 ...
##  $ SU24    : int  2 4 3 2 4 2 2 4 2 4 ...
##  $ ST25    : int  4 2 4 3 4 4 1 4 4 4 ...
##  $ SU26    : int  4 4 4 2 3 2 1 4 4 4 ...
##  $ D27     : int  4 2 3 3 3 5 4 4 5 4 ...
##  $ ST28    : int  4 2 5 3 5 4 1 4 5 2 ...
##  $ SU29    : int  3 3 2 3 3 2 1 2 1 2 ...
##  $ D30     : int  4 3 4 4 3 5 4 3 4 4 ...
##  $ D31     : int  4 4 3 4 4 5 4 4 5 4 ...
##  $ SU32    : int  3 5 5 3 4 3 4 4 3 4 ...
##  $ Ca      : int  2 4 3 3 2 3 4 2 3 2 ...
##  $ Cb      : int  4 4 5 4 4 5 5 4 5 4 ...
##  $ Cc      : int  3 4 4 4 4 4 4 4 4 4 ...
##  $ Cd      : int  4 5 4 4 3 4 4 5 5 5 ...
##  $ Ce      : int  3 5 3 3 3 3 4 3 3 4 ...
##  $ Cf      : int  2 3 4 4 3 4 5 3 3 4 ...
##  $ Cg      : int  3 2 4 4 4 5 5 3 5 4 ...
##  $ Ch      : int  4 4 2 3 4 4 3 3 5 4 ...
##  $ Da      : int  3 4 1 2 3 3 2 2 4 1 ...
##  $ Db      : int  4 3 4 4 4 5 4 4 2 4 ...
##  $ Dc      : int  4 3 4 5 4 4 4 4 4 4 ...
##  $ Dd      : int  5 4 1 2 4 4 5 3 5 2 ...
##  $ De      : int  4 3 4 5 4 4 5 4 4 2 ...
##  $ Df      : int  2 2 1 1 2 3 1 1 4 1 ...
##  $ Dg      : int  4 3 3 5 5 4 4 4 5 1 ...
##  $ Dh      : int  3 3 1 4 5 3 4 1 4 1 ...
##  $ Di      : int  4 2 1 2 3 3 2 1 4 1 ...
##  $ Dj      : int  4 4 5 5 3 5 4 5 2 4 ...
##  $ Age     : int  53 55 49 53 49 38 50 37 37 42 ...
##  $ Attitude: int  37 31 25 35 37 38 35 29 38 21 ...
##  $ Points  : int  25 12 24 10 22 21 21 31 24 26 ...
##  $ gender  : chr  "F" "M" "F" "M" ...

As it shows, the data have 186 observations and 63 columns and data structure were also given (all are intiger, except gender which is character).

#3— selecting columns
Name of group of columns (in List) realted to questions under the umbrella of ‘deep’, ‘surface’ and ‘strategic learning’

#install.packages('dplyr')
library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
deep_questions <- c("D03", "D11", "D19", "D27", "D07", "D14", "D22", "D30", "D06",  "D15", "D23", "D31")
surface_questions <- c("SU02","SU10","SU18","SU26", "SU05","SU13","SU21","SU29","SU08","SU16","SU24","SU32")
strategic_questions <- c("ST01","ST09","ST17","ST25","ST04","ST12","ST20","ST28")

# select the columns related to deep learning and create column 'deep' by averaging
deep_columns <- select(lrn14, one_of(deep_questions))
lrn14$deep <- rowMeans(deep_columns)

# select the columns related to surface_questions and create column 'surface' by averaging
surface_columns <- select(lrn14, one_of(surface_questions))
lrn14$surface <- rowMeans(surface_columns)

# select the columns related to strategic_questions and create column 'strategic' by averaging
strategic_columns <- select(lrn14, one_of(strategic_questions))
lrn14$strategic <- rowMeans(strategic_columns)


# select only specific needed columns, made recently
keep_columns <- c("gender","Age","Attitude", "deep", "strategic", "surface", "Points")
needed_columns_2014 <- select(lrn14, one_of(keep_columns))
# check structure of the new df
str(needed_columns_2014)
## 'data.frame':    183 obs. of  7 variables:
##  $ gender   : chr  "F" "M" "F" "M" ...
##  $ Age      : int  53 55 49 53 49 38 50 37 37 42 ...
##  $ Attitude : int  37 31 25 35 37 38 35 29 38 21 ...
##  $ deep     : num  3.58 2.92 3.5 3.5 3.67 ...
##  $ strategic: num  3.38 2.75 3.62 3.12 3.62 ...
##  $ surface  : num  2.58 3.17 2.25 2.25 2.83 ...
##  $ Points   : int  25 12 24 10 22 21 21 31 24 26 ...
# Modfiy columns names, let's make all names smallcase (not capital)
colnames(needed_columns_2014)[2] <- "age"
colnames(needed_columns_2014)[3] <- "attitude"
colnames(needed_columns_2014)[7] <- "points"

#check if worked fine
str(needed_columns_2014)
## 'data.frame':    183 obs. of  7 variables:
##  $ gender   : chr  "F" "M" "F" "M" ...
##  $ age      : int  53 55 49 53 49 38 50 37 37 42 ...
##  $ attitude : int  37 31 25 35 37 38 35 29 38 21 ...
##  $ deep     : num  3.58 2.92 3.5 3.5 3.67 ...
##  $ strategic: num  3.38 2.75 3.62 3.12 3.62 ...
##  $ surface  : num  2.58 3.17 2.25 2.25 2.83 ...
##  $ points   : int  25 12 24 10 22 21 21 31 24 26 ...
#excluding rows (observations) with point in exam >0
needed_columns_2014 <- filter(needed_columns_2014, points > 0) #alternatively can be points *** != 0 ***

#let's check descriptive staticial summary of columns, for FYI
summary(needed_columns_2014)
##     gender               age           attitude          deep      
##  Length:166         Min.   :17.00   Min.   :14.00   Min.   :1.583  
##  Class :character   1st Qu.:21.00   1st Qu.:26.00   1st Qu.:3.333  
##  Mode  :character   Median :22.00   Median :32.00   Median :3.667  
##                     Mean   :25.51   Mean   :31.43   Mean   :3.680  
##                     3rd Qu.:27.00   3rd Qu.:37.00   3rd Qu.:4.083  
##                     Max.   :55.00   Max.   :50.00   Max.   :4.917  
##    strategic        surface          points     
##  Min.   :1.250   Min.   :1.583   Min.   : 7.00  
##  1st Qu.:2.625   1st Qu.:2.417   1st Qu.:19.00  
##  Median :3.188   Median :2.833   Median :23.00  
##  Mean   :3.121   Mean   :2.787   Mean   :22.72  
##  3rd Qu.:3.625   3rd Qu.:3.167   3rd Qu.:27.75  
##  Max.   :5.000   Max.   :4.333   Max.   :33.00
View(needed_columns_2014) #to view the table in R

#4—

#set working dir
setwd("C:/Users/Mohammad/Documents/IODS-project/data/")

#Save the analysis dataset to the ‘data’ folder
write.csv(needed_columns_2014, "learning2014.csv") 
#for the sake of being sure, read the data again!
learning2014_read_again <- read.csv("C:/Users/Mohammad/Documents/IODS-project/data/learning2014.csv")
head(learning2014_read_again)
##   X gender age attitude     deep strategic  surface points
## 1 1      F  53       37 3.583333     3.375 2.583333     25
## 2 2      M  55       31 2.916667     2.750 3.166667     12
## 3 3      F  49       25 3.500000     3.625 2.250000     24
## 4 4      M  53       35 3.500000     3.125 2.250000     10
## 5 5      M  49       37 3.666667     3.625 2.833333     22
## 6 6      F  38       38 4.750000     3.625 2.416667     21
tail(learning2014_read_again)
##       X gender age attitude     deep strategic  surface points
## 161 161      F  19       20 4.083333     3.375 2.833333     20
## 162 162      F  22       42 2.916667     1.750 3.166667     28
## 163 163      M  35       41 3.833333     3.000 2.750000     31
## 164 164      F  18       37 3.166667     2.625 3.416667     18
## 165 165      F  19       36 3.416667     2.625 3.000000     30
## 166 166      M  21       18 4.083333     3.375 2.666667     19
str(learning2014_read_again)
## 'data.frame':    166 obs. of  8 variables:
##  $ X        : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ gender   : chr  "F" "M" "F" "M" ...
##  $ age      : int  53 55 49 53 49 38 50 37 37 42 ...
##  $ attitude : int  37 31 25 35 37 38 35 29 38 21 ...
##  $ deep     : num  3.58 2.92 3.5 3.5 3.67 ...
##  $ strategic: num  3.38 2.75 3.62 3.12 3.62 ...
##  $ surface  : num  2.58 3.17 2.25 2.25 2.83 ...
##  $ points   : int  25 12 24 10 22 21 21 31 24 26 ...
summary(learning2014_read_again)
##        X             gender               age           attitude    
##  Min.   :  1.00   Length:166         Min.   :17.00   Min.   :14.00  
##  1st Qu.: 42.25   Class :character   1st Qu.:21.00   1st Qu.:26.00  
##  Median : 83.50   Mode  :character   Median :22.00   Median :32.00  
##  Mean   : 83.50                      Mean   :25.51   Mean   :31.43  
##  3rd Qu.:124.75                      3rd Qu.:27.00   3rd Qu.:37.00  
##  Max.   :166.00                      Max.   :55.00   Max.   :50.00  
##       deep         strategic        surface          points     
##  Min.   :1.583   Min.   :1.250   Min.   :1.583   Min.   : 7.00  
##  1st Qu.:3.333   1st Qu.:2.625   1st Qu.:2.417   1st Qu.:19.00  
##  Median :3.667   Median :3.188   Median :2.833   Median :23.00  
##  Mean   :3.680   Mean   :3.121   Mean   :2.787   Mean   :22.72  
##  3rd Qu.:4.083   3rd Qu.:3.625   3rd Qu.:3.167   3rd Qu.:27.75  
##  Max.   :4.917   Max.   :5.000   Max.   :4.333   Max.   :33.00

Part 2) Data analysis

read the data

learning2014 <- read.csv("C:/Users/Mohammad/Documents/IODS-project/data/learning2014_MI.csv")

#Metadata can be found in https://www.mv.helsinki.fi/home/kvehkala/JYTmooc/JYTOPKYS3-meta.txt"

#alternatively by:
#learning2014 <- read.table("http://s3.amazonaws.com/assets.datacamp.com/production/course_2218/datasets/learning2014.txt", header=TRUE, sep = ",")

** Description about data: **
This dateset is gather by Kimmo Vehkalahti in Introduction to Social Statistics, fall 2014, funded by Teachers Academy funding for him (2013-2015)

str(learning2014)
## 'data.frame':    166 obs. of  8 variables:
##  $ X       : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ gender  : chr  "F" "M" "F" "M" ...
##  $ age     : int  53 55 49 53 49 38 50 37 37 42 ...
##  $ attitude: num  3.7 3.1 2.5 3.5 3.7 3.8 3.5 2.9 3.8 2.1 ...
##  $ deep    : num  3.58 2.92 3.5 3.5 3.67 ...
##  $ stra    : num  3.38 2.75 3.62 3.12 3.62 ...
##  $ surf    : num  2.58 3.17 2.25 2.25 2.83 ...
##  $ points  : int  25 12 24 10 22 21 21 31 24 26 ...
View(learning2014)
dim(learning2014) #this dataset contains 166 row and 8 columns
## [1] 166   8
colnames(learning2014) #column names are printed in consule
## [1] "X"        "gender"   "age"      "attitude" "deep"     "stra"     "surf"    
## [8] "points"
summary(learning2014) #a short view about descriptive staticialfeastures of the columns, e.g. mean, min, max, etc
##        X             gender               age           attitude    
##  Min.   :  1.00   Length:166         Min.   :17.00   Min.   :1.400  
##  1st Qu.: 42.25   Class :character   1st Qu.:21.00   1st Qu.:2.600  
##  Median : 83.50   Mode  :character   Median :22.00   Median :3.200  
##  Mean   : 83.50                      Mean   :25.51   Mean   :3.143  
##  3rd Qu.:124.75                      3rd Qu.:27.00   3rd Qu.:3.700  
##  Max.   :166.00                      Max.   :55.00   Max.   :5.000  
##       deep            stra            surf           points     
##  Min.   :1.583   Min.   :1.250   Min.   :1.583   Min.   : 7.00  
##  1st Qu.:3.333   1st Qu.:2.625   1st Qu.:2.417   1st Qu.:19.00  
##  Median :3.667   Median :3.188   Median :2.833   Median :23.00  
##  Mean   :3.680   Mean   :3.121   Mean   :2.787   Mean   :22.72  
##  3rd Qu.:4.083   3rd Qu.:3.625   3rd Qu.:3.167   3rd Qu.:27.75  
##  Max.   :4.917   Max.   :5.000   Max.   :4.333   Max.   :33.00
table(learning2014$gender) # to show the number of participants by gender"
## 
##   F   M 
## 110  56
table(learning2014$age) # to show the number of participants by age"
## 
## 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 37 38 39 40 42 44 45 
##  1  3  7 25 31 20 12  7 10  6  3  2  4  5  3  1  3  3  2  4  2  1  1  1  1  1 
## 48 49 50 53 55 
##  1  2  1  2  1

#2—

#graphical shows interactions (i.e. correlations and distrbutions) of columns (e.g. variables) of this dataset togehter
plot(learning2014[c(-1,-2)]) #[c(-1,-2)] >>> is to remove indexcolumn and gender (strings) elements of df, source: https://statisticsglobe.com/remove-element-from-list-in-r

hist(learning2014$age, col='grey') # to check histogram of a column (e.g. age)

Extra coding 1: Loop to automaticlly plot many variables

A loop to automaticlly plot histogram of all variables NB: Check that variables are all numberical, otherwise in gender (male/female, this will stop)

colnames(learning2014[-2]) #[-2] is to remove an element of list, source: https://statisticsglobe.com/remove-element-from-list-in-r
## [1] "X"        "age"      "attitude" "deep"     "stra"     "surf"     "points"
for (i in colnames(learning2014[-2])){
  print(i)
  hist(learning2014[[i]], col='grey')
}
## [1] "X"

## [1] "age"

## [1] "attitude"

## [1] "deep"

## [1] "stra"

## [1] "surf"

## [1] "points"

** Visualize with ggplot**

#install and load it
#install.packages("ggplot2")
library(ggplot2)

# setup plot with data and aesthetic mapping
p1 <- ggplot(learning2014, aes(x = attitude, y = points, col = gender))

# select visualization type (points)
p2 <- p1 + geom_point()

# draw the plot
p2

# add a regression line
p3 <- p2 + geom_smooth(method = "lm")

# add a main title and draw the plot
p4 <- p3 + ggtitle("Student's attitude versus exam points")
p4
## `geom_smooth()` using formula 'y ~ x'

Interpretation:
As graph shows, there is a positive strong correlation between student’s attitude and their exam points the correlation was more stronger with Male student as the slop is higher than female. So, attitude of male student indulents the exame point more than attitue of females

Still interested on more visualization?

If so, read this section :)
Draw a scatter plot matrix

#pairs(learning2014[c(-1:-2)])# ___c(-1:-2)___, is to remove those column
pairs(learning2014[c(-1:-2)], col= c("black","red"))#learning2014$gender)

F: black, M: red

Do you wish to plot even more fancy and advanced?

If yes, use GGally and ggplot2 libraries to create even more advanced plot matrix with ggpairs()

library(GGally)
## Registered S3 method overwritten by 'GGally':
##   method from   
##   +.gg   ggplot2
library(ggplot2)
p <- ggpairs(learning2014[c(-1:-2)], mapping = aes(), lower = list(combo = wrap("facethist", bins = 20)))
p

I hope you got facinated and liked it like me :)
The graph shows a (strong) negative correlation (r2= -324) between surface and deep learning2014

#3—
Now that we got familiar with variables and their distribution and correction, it’s time to fit a regression model

—- Simple linear regression model —–

Let’s create a scatter plot of points versus attitude variables

qplot(attitude, points, data = learning2014) + geom_smooth(method = "lm")
## `geom_smooth()` using formula 'y ~ x'

# let's make a simple linear regression model between exam points and attitude variables
my_model <- lm(points ~ attitude, data = learning2014)

# print out a summary of the model
summary(my_model)
## 
## Call:
## lm(formula = points ~ attitude, data = learning2014)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -16.9763  -3.2119   0.4339   4.1534  10.6645 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  11.6372     1.8303   6.358 1.95e-09 ***
## attitude      3.5255     0.5674   6.214 4.12e-09 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.32 on 164 degrees of freedom
## Multiple R-squared:  0.1906, Adjusted R-squared:  0.1856 
## F-statistic: 38.61 on 1 and 164 DF,  p-value: 4.119e-09

Interpretation: This model shows that attitude has statistically significant relationship with the target variable, and can be used to predict the exam points, becuase Pr(>|t|) values are way below 0.05, thus, our model is statistically significant in the significance level of 95%, in fact, this case is even valid in the significance level of >99%, as the values are <0.01" Our model will be in fact this: points = 3.5255*attitude+11.6372, as linear model is y=ax+b, where a is coeffient of the variable x, and b is the intercept, both can be derive from model summary"

—- Multilinear regression model (with multiple variables) —-

Let’s select variables f interest and plo them:

ggpairs(learning2014, lower = list(combo = wrap("facethist", bins = 20)))

# create a regression model with multiple explanatory variables
my_model2 <- lm(points ~ attitude + stra + surf, data = learning2014)

# print out a summary of the model
summary(my_model2)
## 
## Call:
## lm(formula = points ~ attitude + stra + surf, data = learning2014)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -17.1550  -3.4346   0.5156   3.6401  10.8952 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  11.0171     3.6837   2.991  0.00322 ** 
## attitude      3.3952     0.5741   5.913 1.93e-08 ***
## stra          0.8531     0.5416   1.575  0.11716    
## surf         -0.5861     0.8014  -0.731  0.46563    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.296 on 162 degrees of freedom
## Multiple R-squared:  0.2074, Adjusted R-squared:  0.1927 
## F-statistic: 14.13 on 3 and 162 DF,  p-value: 3.156e-08

In this model now, we tried to predict the exam point using attitude, stra, and surf variables, but as the results show. only attitude has statistically significant relation with the exam model, thus, we shall change the other variables

#4—
It seems that there is not significant relationship between exam point and stratigic learnign and surface. So we shall rmeove them

Lets try another model
my_model3 <- lm(points ~ attitude + stra+ deep +age, data = learning2014)
summary(my_model3)
## 
## Call:
## lm(formula = points ~ attitude + stra + deep + age, data = learning2014)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -17.9941  -3.0839   0.5037   3.5519  11.3298 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 13.24374    3.57079   3.709 0.000286 ***
## attitude     3.53884    0.56538   6.259 3.37e-09 ***
## stra         1.05032    0.53652   1.958 0.051998 .  
## deep        -0.73226    0.74677  -0.981 0.328273    
## age         -0.08750    0.05303  -1.650 0.100866    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.261 on 161 degrees of freedom
## Multiple R-squared:  0.2228, Adjusted R-squared:  0.2035 
## F-statistic: 11.54 on 4 and 161 DF,  p-value: 2.915e-08

As model summary shows, still other variables such as age, deep, stra are not significant, but stra is very close to significance level of 95%, as it is Pr(>|t|) value is just a bit higher than 0.05 (so, very close

#5—

Yet another multiple linear regression model
my_model4 <- lm(points ~ attitude + stra, data = learning2014)
summary(my_model4)
## 
## Call:
## lm(formula = points ~ attitude + stra, data = learning2014)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -17.6436  -3.3113   0.5575   3.7928  10.9295 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   8.9729     2.3959   3.745  0.00025 ***
## attitude      3.4658     0.5652   6.132 6.31e-09 ***
## stra          0.9137     0.5345   1.709  0.08927 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.289 on 163 degrees of freedom
## Multiple R-squared:  0.2048, Adjusted R-squared:  0.1951 
## F-statistic: 20.99 on 2 and 163 DF,  p-value: 7.734e-09

Interpretation
Same conclusion as above, stra is not statistically significant (within significance level of 95%), so we could either remove it from equation (preffered), or keep it for the purpose of this practical, to have multi-variable regression.

# draw diagnostic plots using the plot() function. Choose the plots 1, 2 and 5
par(mfrow = c(2,2)) #defines that plots go to 2*2 frame (joining plots together)
plot(my_model4, which = c(1,2,5))

Interpretation
In linear regression, we assume:
1. linear relationship between variables and target variable
2. error are distributed normally
3. Errors do not correlated
4. errors have constant variance
5. the size of given error does not depend of the explanatory variables.

So, we can still check them as following:
To explore the normality assumption, QQ-plot can help us. As it shows, there is a reasonable fit between the dots and the line. Thus, it was normally distrbuted.
To analyse the assumption of constant variance of errors (#4 above), we can check residuals vs fitted plot. It shows a reasonable random spead/distribution of dots around the line. So, this assumtion was also valid.
In other words, Residual vs Fitted plot confirm that the fitted model is appropriate becuase the variabilty of residual is not increasing with increase of fitted value.

Residuals vs Leverage plot helps identify which observation have an unusual high impact, like outlier. It seems that there is no such outlier in this data

more info about interpretations: https://vimeo.com/199820384

Extra coding 2: Box-plot

Let’s also check the residuales of the model using a boxplot. The closer to 0 the better the model. In other words, it shows the symmetry and specifically the normality of the error terms in the regression model.

boxplot(my_model4$residuals,
        main = "Residuals of the model",
        xlab = "Residual values",
        ylab = "",
        col = "orange",
        border = "brown",
        horizontal = F,
        notch = F)

END of exercise 2: linear regression analysis

thank you and all the best


Insert chapter 2 title here

Describe the work you have done this week and summarize your learning.

date()
## [1] "Wed Nov 25 19:30:01 2020"

Here we go again…


About the project

Welcome to my page

This is the learning diary for the course ‘Introduction to Open Data Science 2020’. I would like to further develop my data science skills in R and get familiar with Github.
My GitHub repository is https://github.com/Imangholiloo/IODS-project

# Let's get the date and time
date()
## [1] "Wed Nov 25 19:30:01 2020"

alternatively, print(Sys.time())

# Let's say hello to you via R!.
for  (i in "Hello anyone"){
  print (i)}
## [1] "Hello anyone"
# load package
library(rmarkdown)


————————————————————————————

Mohammad Imangholiloo
20201112

Exercise 3: Logistic regression analysis

In this script we assess Student Performance Data Set from ___URL: https://archive.ics.uci.edu/ml/datasets/Student+Performance___

Data description:
This data approach student achievement in secondary education of two Portuguese schools. The data attributes include student grades, demographic, social and school related features) and it was collected by using school reports and questionnaires. Two datasets are provided regarding the performance in two distinct subjects: Mathematics (mat) and Portuguese language (por). ___more info: https://archive.ics.uci.edu/ml/datasets/Student+Performance___

Part 1) Data wrangling

# Read the data to R
# Set working directory
setwd("C:/Users/Mohammad/Documents/IODS-project/data")

# read math class questionnaire data into memory
student_mat=read.csv("student-mat.csv",sep=";",header=TRUE)

# Read portuguese class questionnaire data into memory
student_por=read.csv("student-por.csv",sep=";",header=TRUE)

# check dimensions of both data
dim(student_mat)
## [1] 395  33
dim(student_por)
## [1] 649  33
# check structure of both data
str(student_mat)
## 'data.frame':    395 obs. of  33 variables:
##  $ school    : chr  "GP" "GP" "GP" "GP" ...
##  $ sex       : chr  "F" "F" "F" "F" ...
##  $ age       : int  18 17 15 15 16 16 16 17 15 15 ...
##  $ address   : chr  "U" "U" "U" "U" ...
##  $ famsize   : chr  "GT3" "GT3" "LE3" "GT3" ...
##  $ Pstatus   : chr  "A" "T" "T" "T" ...
##  $ Medu      : int  4 1 1 4 3 4 2 4 3 3 ...
##  $ Fedu      : int  4 1 1 2 3 3 2 4 2 4 ...
##  $ Mjob      : chr  "at_home" "at_home" "at_home" "health" ...
##  $ Fjob      : chr  "teacher" "other" "other" "services" ...
##  $ reason    : chr  "course" "course" "other" "home" ...
##  $ guardian  : chr  "mother" "father" "mother" "mother" ...
##  $ traveltime: int  2 1 1 1 1 1 1 2 1 1 ...
##  $ studytime : int  2 2 2 3 2 2 2 2 2 2 ...
##  $ failures  : int  0 0 3 0 0 0 0 0 0 0 ...
##  $ schoolsup : chr  "yes" "no" "yes" "no" ...
##  $ famsup    : chr  "no" "yes" "no" "yes" ...
##  $ paid      : chr  "no" "no" "yes" "yes" ...
##  $ activities: chr  "no" "no" "no" "yes" ...
##  $ nursery   : chr  "yes" "no" "yes" "yes" ...
##  $ higher    : chr  "yes" "yes" "yes" "yes" ...
##  $ internet  : chr  "no" "yes" "yes" "yes" ...
##  $ romantic  : chr  "no" "no" "no" "yes" ...
##  $ famrel    : int  4 5 4 3 4 5 4 4 4 5 ...
##  $ freetime  : int  3 3 3 2 3 4 4 1 2 5 ...
##  $ goout     : int  4 3 2 2 2 2 4 4 2 1 ...
##  $ Dalc      : int  1 1 2 1 1 1 1 1 1 1 ...
##  $ Walc      : int  1 1 3 1 2 2 1 1 1 1 ...
##  $ health    : int  3 3 3 5 5 5 3 1 1 5 ...
##  $ absences  : int  6 4 10 2 4 10 0 6 0 0 ...
##  $ G1        : int  5 5 7 15 6 15 12 6 16 14 ...
##  $ G2        : int  6 5 8 14 10 15 12 5 18 15 ...
##  $ G3        : int  6 6 10 15 10 15 11 6 19 15 ...
str(student_por)
## 'data.frame':    649 obs. of  33 variables:
##  $ school    : chr  "GP" "GP" "GP" "GP" ...
##  $ sex       : chr  "F" "F" "F" "F" ...
##  $ age       : int  18 17 15 15 16 16 16 17 15 15 ...
##  $ address   : chr  "U" "U" "U" "U" ...
##  $ famsize   : chr  "GT3" "GT3" "LE3" "GT3" ...
##  $ Pstatus   : chr  "A" "T" "T" "T" ...
##  $ Medu      : int  4 1 1 4 3 4 2 4 3 3 ...
##  $ Fedu      : int  4 1 1 2 3 3 2 4 2 4 ...
##  $ Mjob      : chr  "at_home" "at_home" "at_home" "health" ...
##  $ Fjob      : chr  "teacher" "other" "other" "services" ...
##  $ reason    : chr  "course" "course" "other" "home" ...
##  $ guardian  : chr  "mother" "father" "mother" "mother" ...
##  $ traveltime: int  2 1 1 1 1 1 1 2 1 1 ...
##  $ studytime : int  2 2 2 3 2 2 2 2 2 2 ...
##  $ failures  : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ schoolsup : chr  "yes" "no" "yes" "no" ...
##  $ famsup    : chr  "no" "yes" "no" "yes" ...
##  $ paid      : chr  "no" "no" "no" "no" ...
##  $ activities: chr  "no" "no" "no" "yes" ...
##  $ nursery   : chr  "yes" "no" "yes" "yes" ...
##  $ higher    : chr  "yes" "yes" "yes" "yes" ...
##  $ internet  : chr  "no" "yes" "yes" "yes" ...
##  $ romantic  : chr  "no" "no" "no" "yes" ...
##  $ famrel    : int  4 5 4 3 4 5 4 4 4 5 ...
##  $ freetime  : int  3 3 3 2 3 4 4 1 2 5 ...
##  $ goout     : int  4 3 2 2 2 2 4 4 2 1 ...
##  $ Dalc      : int  1 1 2 1 1 1 1 1 1 1 ...
##  $ Walc      : int  1 1 3 1 2 2 1 1 1 1 ...
##  $ health    : int  3 3 3 5 5 5 3 1 1 5 ...
##  $ absences  : int  4 2 6 0 0 6 0 2 0 0 ...
##  $ G1        : int  0 9 12 14 11 12 13 10 15 12 ...
##  $ G2        : int  11 11 13 14 13 12 12 13 16 12 ...
##  $ G3        : int  11 11 12 14 13 13 13 13 17 13 ...
# check first 5 rows of both data
head(student_mat)
##   school sex age address famsize Pstatus Medu Fedu     Mjob     Fjob     reason
## 1     GP   F  18       U     GT3       A    4    4  at_home  teacher     course
## 2     GP   F  17       U     GT3       T    1    1  at_home    other     course
## 3     GP   F  15       U     LE3       T    1    1  at_home    other      other
## 4     GP   F  15       U     GT3       T    4    2   health services       home
## 5     GP   F  16       U     GT3       T    3    3    other    other       home
## 6     GP   M  16       U     LE3       T    4    3 services    other reputation
##   guardian traveltime studytime failures schoolsup famsup paid activities
## 1   mother          2         2        0       yes     no   no         no
## 2   father          1         2        0        no    yes   no         no
## 3   mother          1         2        3       yes     no  yes         no
## 4   mother          1         3        0        no    yes  yes        yes
## 5   father          1         2        0        no    yes  yes         no
## 6   mother          1         2        0        no    yes  yes        yes
##   nursery higher internet romantic famrel freetime goout Dalc Walc health
## 1     yes    yes       no       no      4        3     4    1    1      3
## 2      no    yes      yes       no      5        3     3    1    1      3
## 3     yes    yes      yes       no      4        3     2    2    3      3
## 4     yes    yes      yes      yes      3        2     2    1    1      5
## 5     yes    yes       no       no      4        3     2    1    2      5
## 6     yes    yes      yes       no      5        4     2    1    2      5
##   absences G1 G2 G3
## 1        6  5  6  6
## 2        4  5  5  6
## 3       10  7  8 10
## 4        2 15 14 15
## 5        4  6 10 10
## 6       10 15 15 15
head(student_por)
##   school sex age address famsize Pstatus Medu Fedu     Mjob     Fjob     reason
## 1     GP   F  18       U     GT3       A    4    4  at_home  teacher     course
## 2     GP   F  17       U     GT3       T    1    1  at_home    other     course
## 3     GP   F  15       U     LE3       T    1    1  at_home    other      other
## 4     GP   F  15       U     GT3       T    4    2   health services       home
## 5     GP   F  16       U     GT3       T    3    3    other    other       home
## 6     GP   M  16       U     LE3       T    4    3 services    other reputation
##   guardian traveltime studytime failures schoolsup famsup paid activities
## 1   mother          2         2        0       yes     no   no         no
## 2   father          1         2        0        no    yes   no         no
## 3   mother          1         2        0       yes     no   no         no
## 4   mother          1         3        0        no    yes   no        yes
## 5   father          1         2        0        no    yes   no         no
## 6   mother          1         2        0        no    yes   no        yes
##   nursery higher internet romantic famrel freetime goout Dalc Walc health
## 1     yes    yes       no       no      4        3     4    1    1      3
## 2      no    yes      yes       no      5        3     3    1    1      3
## 3     yes    yes      yes       no      4        3     2    2    3      3
## 4     yes    yes      yes      yes      3        2     2    1    1      5
## 5     yes    yes       no       no      4        3     2    1    2      5
## 6     yes    yes      yes       no      5        4     2    1    2      5
##   absences G1 G2 G3
## 1        4  0 11 11
## 2        2  9 11 11
## 3        6 12 13 12
## 4        0 14 14 14
## 5        0 11 13 13
## 6        6 12 12 13
# check last 5 rows of both data
tail(student_mat)
##     school sex age address famsize Pstatus Medu Fedu     Mjob     Fjob reason
## 390     MS   F  18       U     GT3       T    1    1    other    other course
## 391     MS   M  20       U     LE3       A    2    2 services services course
## 392     MS   M  17       U     LE3       T    3    1 services services course
## 393     MS   M  21       R     GT3       T    1    1    other    other course
## 394     MS   M  18       R     LE3       T    3    2 services    other course
## 395     MS   M  19       U     LE3       T    1    1    other  at_home course
##     guardian traveltime studytime failures schoolsup famsup paid activities
## 390   mother          2         2        1        no     no   no        yes
## 391    other          1         2        2        no    yes  yes         no
## 392   mother          2         1        0        no     no   no         no
## 393    other          1         1        3        no     no   no         no
## 394   mother          3         1        0        no     no   no         no
## 395   father          1         1        0        no     no   no         no
##     nursery higher internet romantic famrel freetime goout Dalc Walc health
## 390     yes    yes       no       no      1        1     1    1    1      5
## 391     yes    yes       no       no      5        5     4    4    5      4
## 392      no    yes      yes       no      2        4     5    3    4      2
## 393      no    yes       no       no      5        5     3    3    3      3
## 394      no    yes      yes       no      4        4     1    3    4      5
## 395     yes    yes      yes       no      3        2     3    3    3      5
##     absences G1 G2 G3
## 390        0  6  5  0
## 391       11  9  9  9
## 392        3 14 16 16
## 393        3 10  8  7
## 394        0 11 12 10
## 395        5  8  9  9
tail(student_por)
##     school sex age address famsize Pstatus Medu Fedu     Mjob     Fjob
## 644     MS   F  18       R     GT3       T    4    4  teacher  at_home
## 645     MS   F  19       R     GT3       T    2    3 services    other
## 646     MS   F  18       U     LE3       T    3    1  teacher services
## 647     MS   F  18       U     GT3       T    1    1    other    other
## 648     MS   M  17       U     LE3       T    3    1 services services
## 649     MS   M  18       R     LE3       T    3    2 services    other
##         reason guardian traveltime studytime failures schoolsup famsup paid
## 644 reputation   mother          3         1        0        no    yes   no
## 645     course   mother          1         3        1        no     no   no
## 646     course   mother          1         2        0        no    yes   no
## 647     course   mother          2         2        0        no     no   no
## 648     course   mother          2         1        0        no     no   no
## 649     course   mother          3         1        0        no     no   no
##     activities nursery higher internet romantic famrel freetime goout Dalc Walc
## 644        yes     yes    yes      yes      yes      4        4     3    2    2
## 645        yes      no    yes      yes       no      5        4     2    1    2
## 646         no     yes    yes      yes       no      4        3     4    1    1
## 647        yes     yes    yes       no       no      1        1     1    1    1
## 648         no      no    yes      yes       no      2        4     5    3    4
## 649         no      no    yes      yes       no      4        4     1    3    4
##     health absences G1 G2 G3
## 644      5        4  7  9 10
## 645      5        4 10 11 10
## 646      1        4 15 15 16
## 647      5        6 11 12  9
## 648      2        6 10 10 10
## 649      5        4 10 11 11
#a short view about descriptive statistical features of the columns, e.g. mean, min, max, etc.
summary(student_mat) 
##     school              sex                 age         address         
##  Length:395         Length:395         Min.   :15.0   Length:395        
##  Class :character   Class :character   1st Qu.:16.0   Class :character  
##  Mode  :character   Mode  :character   Median :17.0   Mode  :character  
##                                        Mean   :16.7                     
##                                        3rd Qu.:18.0                     
##                                        Max.   :22.0                     
##    famsize            Pstatus               Medu            Fedu      
##  Length:395         Length:395         Min.   :0.000   Min.   :0.000  
##  Class :character   Class :character   1st Qu.:2.000   1st Qu.:2.000  
##  Mode  :character   Mode  :character   Median :3.000   Median :2.000  
##                                        Mean   :2.749   Mean   :2.522  
##                                        3rd Qu.:4.000   3rd Qu.:3.000  
##                                        Max.   :4.000   Max.   :4.000  
##      Mjob               Fjob              reason            guardian        
##  Length:395         Length:395         Length:395         Length:395        
##  Class :character   Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character   Mode  :character  
##                                                                             
##                                                                             
##                                                                             
##    traveltime      studytime        failures       schoolsup        
##  Min.   :1.000   Min.   :1.000   Min.   :0.0000   Length:395        
##  1st Qu.:1.000   1st Qu.:1.000   1st Qu.:0.0000   Class :character  
##  Median :1.000   Median :2.000   Median :0.0000   Mode  :character  
##  Mean   :1.448   Mean   :2.035   Mean   :0.3342                     
##  3rd Qu.:2.000   3rd Qu.:2.000   3rd Qu.:0.0000                     
##  Max.   :4.000   Max.   :4.000   Max.   :3.0000                     
##     famsup              paid            activities          nursery         
##  Length:395         Length:395         Length:395         Length:395        
##  Class :character   Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character   Mode  :character  
##                                                                             
##                                                                             
##                                                                             
##     higher            internet           romantic             famrel     
##  Length:395         Length:395         Length:395         Min.   :1.000  
##  Class :character   Class :character   Class :character   1st Qu.:4.000  
##  Mode  :character   Mode  :character   Mode  :character   Median :4.000  
##                                                           Mean   :3.944  
##                                                           3rd Qu.:5.000  
##                                                           Max.   :5.000  
##     freetime         goout            Dalc            Walc      
##  Min.   :1.000   Min.   :1.000   Min.   :1.000   Min.   :1.000  
##  1st Qu.:3.000   1st Qu.:2.000   1st Qu.:1.000   1st Qu.:1.000  
##  Median :3.000   Median :3.000   Median :1.000   Median :2.000  
##  Mean   :3.235   Mean   :3.109   Mean   :1.481   Mean   :2.291  
##  3rd Qu.:4.000   3rd Qu.:4.000   3rd Qu.:2.000   3rd Qu.:3.000  
##  Max.   :5.000   Max.   :5.000   Max.   :5.000   Max.   :5.000  
##      health         absences            G1              G2       
##  Min.   :1.000   Min.   : 0.000   Min.   : 3.00   Min.   : 0.00  
##  1st Qu.:3.000   1st Qu.: 0.000   1st Qu.: 8.00   1st Qu.: 9.00  
##  Median :4.000   Median : 4.000   Median :11.00   Median :11.00  
##  Mean   :3.554   Mean   : 5.709   Mean   :10.91   Mean   :10.71  
##  3rd Qu.:5.000   3rd Qu.: 8.000   3rd Qu.:13.00   3rd Qu.:13.00  
##  Max.   :5.000   Max.   :75.000   Max.   :19.00   Max.   :19.00  
##        G3       
##  Min.   : 0.00  
##  1st Qu.: 8.00  
##  Median :11.00  
##  Mean   :10.42  
##  3rd Qu.:14.00  
##  Max.   :20.00
summary(student_por) 
##     school              sex                 age          address         
##  Length:649         Length:649         Min.   :15.00   Length:649        
##  Class :character   Class :character   1st Qu.:16.00   Class :character  
##  Mode  :character   Mode  :character   Median :17.00   Mode  :character  
##                                        Mean   :16.74                     
##                                        3rd Qu.:18.00                     
##                                        Max.   :22.00                     
##    famsize            Pstatus               Medu            Fedu      
##  Length:649         Length:649         Min.   :0.000   Min.   :0.000  
##  Class :character   Class :character   1st Qu.:2.000   1st Qu.:1.000  
##  Mode  :character   Mode  :character   Median :2.000   Median :2.000  
##                                        Mean   :2.515   Mean   :2.307  
##                                        3rd Qu.:4.000   3rd Qu.:3.000  
##                                        Max.   :4.000   Max.   :4.000  
##      Mjob               Fjob              reason            guardian        
##  Length:649         Length:649         Length:649         Length:649        
##  Class :character   Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character   Mode  :character  
##                                                                             
##                                                                             
##                                                                             
##    traveltime      studytime        failures       schoolsup        
##  Min.   :1.000   Min.   :1.000   Min.   :0.0000   Length:649        
##  1st Qu.:1.000   1st Qu.:1.000   1st Qu.:0.0000   Class :character  
##  Median :1.000   Median :2.000   Median :0.0000   Mode  :character  
##  Mean   :1.569   Mean   :1.931   Mean   :0.2219                     
##  3rd Qu.:2.000   3rd Qu.:2.000   3rd Qu.:0.0000                     
##  Max.   :4.000   Max.   :4.000   Max.   :3.0000                     
##     famsup              paid            activities          nursery         
##  Length:649         Length:649         Length:649         Length:649        
##  Class :character   Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character   Mode  :character  
##                                                                             
##                                                                             
##                                                                             
##     higher            internet           romantic             famrel     
##  Length:649         Length:649         Length:649         Min.   :1.000  
##  Class :character   Class :character   Class :character   1st Qu.:4.000  
##  Mode  :character   Mode  :character   Mode  :character   Median :4.000  
##                                                           Mean   :3.931  
##                                                           3rd Qu.:5.000  
##                                                           Max.   :5.000  
##     freetime        goout            Dalc            Walc          health     
##  Min.   :1.00   Min.   :1.000   Min.   :1.000   Min.   :1.00   Min.   :1.000  
##  1st Qu.:3.00   1st Qu.:2.000   1st Qu.:1.000   1st Qu.:1.00   1st Qu.:2.000  
##  Median :3.00   Median :3.000   Median :1.000   Median :2.00   Median :4.000  
##  Mean   :3.18   Mean   :3.185   Mean   :1.502   Mean   :2.28   Mean   :3.536  
##  3rd Qu.:4.00   3rd Qu.:4.000   3rd Qu.:2.000   3rd Qu.:3.00   3rd Qu.:5.000  
##  Max.   :5.00   Max.   :5.000   Max.   :5.000   Max.   :5.00   Max.   :5.000  
##     absences            G1             G2              G3       
##  Min.   : 0.000   Min.   : 0.0   Min.   : 0.00   Min.   : 0.00  
##  1st Qu.: 0.000   1st Qu.:10.0   1st Qu.:10.00   1st Qu.:10.00  
##  Median : 2.000   Median :11.0   Median :11.00   Median :12.00  
##  Mean   : 3.659   Mean   :11.4   Mean   :11.57   Mean   :11.91  
##  3rd Qu.: 6.000   3rd Qu.:13.0   3rd Qu.:13.00   3rd Qu.:14.00  
##  Max.   :32.000   Max.   :19.0   Max.   :19.00   Max.   :19.00
#column names are printed in console
colnames(student_mat) 
##  [1] "school"     "sex"        "age"        "address"    "famsize"   
##  [6] "Pstatus"    "Medu"       "Fedu"       "Mjob"       "Fjob"      
## [11] "reason"     "guardian"   "traveltime" "studytime"  "failures"  
## [16] "schoolsup"  "famsup"     "paid"       "activities" "nursery"   
## [21] "higher"     "internet"   "romantic"   "famrel"     "freetime"  
## [26] "goout"      "Dalc"       "Walc"       "health"     "absences"  
## [31] "G1"         "G2"         "G3"
colnames(student_por)
##  [1] "school"     "sex"        "age"        "address"    "famsize"   
##  [6] "Pstatus"    "Medu"       "Fedu"       "Mjob"       "Fjob"      
## [11] "reason"     "guardian"   "traveltime" "studytime"  "failures"  
## [16] "schoolsup"  "famsup"     "paid"       "activities" "nursery"   
## [21] "higher"     "internet"   "romantic"   "famrel"     "freetime"  
## [26] "goout"      "Dalc"       "Walc"       "health"     "absences"  
## [31] "G1"         "G2"         "G3"
#to show the number of observations based on their school name
table(student_mat$school) 
## 
##  GP  MS 
## 349  46
table(student_por$school)
## 
##  GP  MS 
## 423 226

Merge two datasets usig dplyr library

# load library
library(dplyr)

# make a list of common columns to be used as identifiers in both datasets when merging
join_by_columns <- c("school","sex","age","address","famsize","Pstatus","Medu",
                     "Fedu","Mjob","Fjob","reason","nursery","internet")

# join the two datasets by inner join function of the dplyr library, using the selected identifiers
# Note: inner_join keeps only rows (observations) that arein both input datasets
students_math_por_joint <- inner_join(student_mat, student_por, 
                                      by = join_by_columns, suffix = c("_math", "_por"))

#check dimention of newly merged dataset
dim(students_math_por_joint) # 382 students
## [1] 382  53
# see the new column names
colnames(students_math_por_joint)
##  [1] "school"          "sex"             "age"             "address"        
##  [5] "famsize"         "Pstatus"         "Medu"            "Fedu"           
##  [9] "Mjob"            "Fjob"            "reason"          "guardian_math"  
## [13] "traveltime_math" "studytime_math"  "failures_math"   "schoolsup_math" 
## [17] "famsup_math"     "paid_math"       "activities_math" "nursery"        
## [21] "higher_math"     "internet"        "romantic_math"   "famrel_math"    
## [25] "freetime_math"   "goout_math"      "Dalc_math"       "Walc_math"      
## [29] "health_math"     "absences_math"   "G1_math"         "G2_math"        
## [33] "G3_math"         "guardian_por"    "traveltime_por"  "studytime_por"  
## [37] "failures_por"    "schoolsup_por"   "famsup_por"      "paid_por"       
## [41] "activities_por"  "higher_por"      "romantic_por"    "famrel_por"     
## [45] "freetime_por"    "goout_por"       "Dalc_por"        "Walc_por"       
## [49] "health_por"      "absences_por"    "G1_por"          "G2_por"         
## [53] "G3_por"
# view the file in R
#View(students_math_por_joint)

# glimpse at the data, it is like str() but prints more data. In other words, like a transposed version of print()
glimpse(students_math_por_joint)
## Rows: 382
## Columns: 53
## $ school          <chr> "GP", "GP", "GP", "GP", "GP", "GP", "GP", "GP", "GP...
## $ sex             <chr> "F", "F", "F", "F", "F", "M", "M", "F", "M", "M", "...
## $ age             <int> 18, 17, 15, 15, 16, 16, 16, 17, 15, 15, 15, 15, 15,...
## $ address         <chr> "U", "U", "U", "U", "U", "U", "U", "U", "U", "U", "...
## $ famsize         <chr> "GT3", "GT3", "LE3", "GT3", "GT3", "LE3", "LE3", "G...
## $ Pstatus         <chr> "A", "T", "T", "T", "T", "T", "T", "A", "A", "T", "...
## $ Medu            <int> 4, 1, 1, 4, 3, 4, 2, 4, 3, 3, 4, 2, 4, 4, 2, 4, 4, ...
## $ Fedu            <int> 4, 1, 1, 2, 3, 3, 2, 4, 2, 4, 4, 1, 4, 3, 2, 4, 4, ...
## $ Mjob            <chr> "at_home", "at_home", "at_home", "health", "other",...
## $ Fjob            <chr> "teacher", "other", "other", "services", "other", "...
## $ reason          <chr> "course", "course", "other", "home", "home", "reput...
## $ guardian_math   <chr> "mother", "father", "mother", "mother", "father", "...
## $ traveltime_math <int> 2, 1, 1, 1, 1, 1, 1, 2, 1, 1, 1, 3, 1, 2, 1, 1, 1, ...
## $ studytime_math  <int> 2, 2, 2, 3, 2, 2, 2, 2, 2, 2, 2, 3, 1, 2, 3, 1, 3, ...
## $ failures_math   <int> 0, 0, 3, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
## $ schoolsup_math  <chr> "yes", "no", "yes", "no", "no", "no", "no", "yes", ...
## $ famsup_math     <chr> "no", "yes", "no", "yes", "yes", "yes", "no", "yes"...
## $ paid_math       <chr> "no", "no", "yes", "yes", "yes", "yes", "no", "no",...
## $ activities_math <chr> "no", "no", "no", "yes", "no", "yes", "no", "no", "...
## $ nursery         <chr> "yes", "no", "yes", "yes", "yes", "yes", "yes", "ye...
## $ higher_math     <chr> "yes", "yes", "yes", "yes", "yes", "yes", "yes", "y...
## $ internet        <chr> "no", "yes", "yes", "yes", "no", "yes", "yes", "no"...
## $ romantic_math   <chr> "no", "no", "no", "yes", "no", "no", "no", "no", "n...
## $ famrel_math     <int> 4, 5, 4, 3, 4, 5, 4, 4, 4, 5, 3, 5, 4, 5, 4, 4, 3, ...
## $ freetime_math   <int> 3, 3, 3, 2, 3, 4, 4, 1, 2, 5, 3, 2, 3, 4, 5, 4, 2, ...
## $ goout_math      <int> 4, 3, 2, 2, 2, 2, 4, 4, 2, 1, 3, 2, 3, 3, 2, 4, 3, ...
## $ Dalc_math       <int> 1, 1, 2, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
## $ Walc_math       <int> 1, 1, 3, 1, 2, 2, 1, 1, 1, 1, 2, 1, 3, 2, 1, 2, 2, ...
## $ health_math     <int> 3, 3, 3, 5, 5, 5, 3, 1, 1, 5, 2, 4, 5, 3, 3, 2, 2, ...
## $ absences_math   <int> 6, 4, 10, 2, 4, 10, 0, 6, 0, 0, 0, 4, 2, 2, 0, 4, 6...
## $ G1_math         <int> 5, 5, 7, 15, 6, 15, 12, 6, 16, 14, 10, 10, 14, 10, ...
## $ G2_math         <int> 6, 5, 8, 14, 10, 15, 12, 5, 18, 15, 8, 12, 14, 10, ...
## $ G3_math         <int> 6, 6, 10, 15, 10, 15, 11, 6, 19, 15, 9, 12, 14, 11,...
## $ guardian_por    <chr> "mother", "father", "mother", "mother", "father", "...
## $ traveltime_por  <int> 2, 1, 1, 1, 1, 1, 1, 2, 1, 1, 1, 3, 1, 2, 1, 1, 1, ...
## $ studytime_por   <int> 2, 2, 2, 3, 2, 2, 2, 2, 2, 2, 2, 3, 1, 2, 3, 1, 3, ...
## $ failures_por    <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
## $ schoolsup_por   <chr> "yes", "no", "yes", "no", "no", "no", "no", "yes", ...
## $ famsup_por      <chr> "no", "yes", "no", "yes", "yes", "yes", "no", "yes"...
## $ paid_por        <chr> "no", "no", "no", "no", "no", "no", "no", "no", "no...
## $ activities_por  <chr> "no", "no", "no", "yes", "no", "yes", "no", "no", "...
## $ higher_por      <chr> "yes", "yes", "yes", "yes", "yes", "yes", "yes", "y...
## $ romantic_por    <chr> "no", "no", "no", "yes", "no", "no", "no", "no", "n...
## $ famrel_por      <int> 4, 5, 4, 3, 4, 5, 4, 4, 4, 5, 3, 5, 4, 5, 4, 4, 3, ...
## $ freetime_por    <int> 3, 3, 3, 2, 3, 4, 4, 1, 2, 5, 3, 2, 3, 4, 5, 4, 2, ...
## $ goout_por       <int> 4, 3, 2, 2, 2, 2, 4, 4, 2, 1, 3, 2, 3, 3, 2, 4, 3, ...
## $ Dalc_por        <int> 1, 1, 2, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
## $ Walc_por        <int> 1, 1, 3, 1, 2, 2, 1, 1, 1, 1, 2, 1, 3, 2, 1, 2, 2, ...
## $ health_por      <int> 3, 3, 3, 5, 5, 5, 3, 1, 1, 5, 2, 4, 5, 3, 3, 2, 2, ...
## $ absences_por    <int> 4, 2, 6, 0, 0, 6, 0, 2, 0, 0, 2, 0, 0, 0, 0, 6, 10,...
## $ G1_por          <int> 0, 9, 12, 14, 11, 12, 13, 10, 15, 12, 14, 10, 12, 1...
## $ G2_por          <int> 11, 11, 13, 14, 13, 12, 12, 13, 16, 12, 14, 12, 13,...
## $ G3_por          <int> 11, 11, 12, 14, 13, 13, 13, 13, 17, 13, 14, 13, 12,...
# check data structure
str(students_math_por_joint)
## 'data.frame':    382 obs. of  53 variables:
##  $ school         : chr  "GP" "GP" "GP" "GP" ...
##  $ sex            : chr  "F" "F" "F" "F" ...
##  $ age            : int  18 17 15 15 16 16 16 17 15 15 ...
##  $ address        : chr  "U" "U" "U" "U" ...
##  $ famsize        : chr  "GT3" "GT3" "LE3" "GT3" ...
##  $ Pstatus        : chr  "A" "T" "T" "T" ...
##  $ Medu           : int  4 1 1 4 3 4 2 4 3 3 ...
##  $ Fedu           : int  4 1 1 2 3 3 2 4 2 4 ...
##  $ Mjob           : chr  "at_home" "at_home" "at_home" "health" ...
##  $ Fjob           : chr  "teacher" "other" "other" "services" ...
##  $ reason         : chr  "course" "course" "other" "home" ...
##  $ guardian_math  : chr  "mother" "father" "mother" "mother" ...
##  $ traveltime_math: int  2 1 1 1 1 1 1 2 1 1 ...
##  $ studytime_math : int  2 2 2 3 2 2 2 2 2 2 ...
##  $ failures_math  : int  0 0 3 0 0 0 0 0 0 0 ...
##  $ schoolsup_math : chr  "yes" "no" "yes" "no" ...
##  $ famsup_math    : chr  "no" "yes" "no" "yes" ...
##  $ paid_math      : chr  "no" "no" "yes" "yes" ...
##  $ activities_math: chr  "no" "no" "no" "yes" ...
##  $ nursery        : chr  "yes" "no" "yes" "yes" ...
##  $ higher_math    : chr  "yes" "yes" "yes" "yes" ...
##  $ internet       : chr  "no" "yes" "yes" "yes" ...
##  $ romantic_math  : chr  "no" "no" "no" "yes" ...
##  $ famrel_math    : int  4 5 4 3 4 5 4 4 4 5 ...
##  $ freetime_math  : int  3 3 3 2 3 4 4 1 2 5 ...
##  $ goout_math     : int  4 3 2 2 2 2 4 4 2 1 ...
##  $ Dalc_math      : int  1 1 2 1 1 1 1 1 1 1 ...
##  $ Walc_math      : int  1 1 3 1 2 2 1 1 1 1 ...
##  $ health_math    : int  3 3 3 5 5 5 3 1 1 5 ...
##  $ absences_math  : int  6 4 10 2 4 10 0 6 0 0 ...
##  $ G1_math        : int  5 5 7 15 6 15 12 6 16 14 ...
##  $ G2_math        : int  6 5 8 14 10 15 12 5 18 15 ...
##  $ G3_math        : int  6 6 10 15 10 15 11 6 19 15 ...
##  $ guardian_por   : chr  "mother" "father" "mother" "mother" ...
##  $ traveltime_por : int  2 1 1 1 1 1 1 2 1 1 ...
##  $ studytime_por  : int  2 2 2 3 2 2 2 2 2 2 ...
##  $ failures_por   : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ schoolsup_por  : chr  "yes" "no" "yes" "no" ...
##  $ famsup_por     : chr  "no" "yes" "no" "yes" ...
##  $ paid_por       : chr  "no" "no" "no" "no" ...
##  $ activities_por : chr  "no" "no" "no" "yes" ...
##  $ higher_por     : chr  "yes" "yes" "yes" "yes" ...
##  $ romantic_por   : chr  "no" "no" "no" "yes" ...
##  $ famrel_por     : int  4 5 4 3 4 5 4 4 4 5 ...
##  $ freetime_por   : int  3 3 3 2 3 4 4 1 2 5 ...
##  $ goout_por      : int  4 3 2 2 2 2 4 4 2 1 ...
##  $ Dalc_por       : int  1 1 2 1 1 1 1 1 1 1 ...
##  $ Walc_por       : int  1 1 3 1 2 2 1 1 1 1 ...
##  $ health_por     : int  3 3 3 5 5 5 3 1 1 5 ...
##  $ absences_por   : int  4 2 6 0 0 6 0 2 0 0 ...
##  $ G1_por         : int  0 9 12 14 11 12 13 10 15 12 ...
##  $ G2_por         : int  11 11 13 14 13 12 12 13 16 12 ...
##  $ G3_por         : int  11 11 12 14 13 13 13 13 17 13 ...

Combine the ‘duplicated’ answers in the joined data

# create a new data frame with only the joined columns
alc <- select(students_math_por_joint, one_of(join_by_columns))

# columns that were not used for joining the data
notjoined_columns <- colnames(student_mat)[!colnames(student_mat) %in% join_by_columns]

# print out the columns not used for joining
notjoined_columns
##  [1] "guardian"   "traveltime" "studytime"  "failures"   "schoolsup" 
##  [6] "famsup"     "paid"       "activities" "higher"     "romantic"  
## [11] "famrel"     "freetime"   "goout"      "Dalc"       "Walc"      
## [16] "health"     "absences"   "G1"         "G2"         "G3"
# for every column name not used for joining...
for(column_name in notjoined_columns) {
  # print the column bane while looping to show the progress
  print(column_name)
  # select two columns from 'math_por' with the same original name
  two_columns <- select(students_math_por_joint, starts_with(column_name))
  # select the first column vector of those two columns
  first_column <- select(two_columns, 1)[[1]]
  
  # if that first column  vector is numeric...
  if(is.numeric(first_column)) {
    # take a rounded average of each row of the two columns and
    # add the resulting vector to the alc data frame
    alc[column_name] <- round(rowMeans(two_columns))
  } else { # else if it's not numeric...
    # add the first column vector to the alc data frame
    alc[column_name] <- first_column
  }
}
## [1] "guardian"
## [1] "traveltime"
## [1] "studytime"
## [1] "failures"
## [1] "schoolsup"
## [1] "famsup"
## [1] "paid"
## [1] "activities"
## [1] "higher"
## [1] "romantic"
## [1] "famrel"
## [1] "freetime"
## [1] "goout"
## [1] "Dalc"
## [1] "Walc"
## [1] "health"
## [1] "absences"
## [1] "G1"
## [1] "G2"
## [1] "G3"
# glimpse at the new combined data
glimpse(alc)
## Rows: 382
## Columns: 33
## $ school     <chr> "GP", "GP", "GP", "GP", "GP", "GP", "GP", "GP", "GP", "G...
## $ sex        <chr> "F", "F", "F", "F", "F", "M", "M", "F", "M", "M", "F", "...
## $ age        <int> 18, 17, 15, 15, 16, 16, 16, 17, 15, 15, 15, 15, 15, 15, ...
## $ address    <chr> "U", "U", "U", "U", "U", "U", "U", "U", "U", "U", "U", "...
## $ famsize    <chr> "GT3", "GT3", "LE3", "GT3", "GT3", "LE3", "LE3", "GT3", ...
## $ Pstatus    <chr> "A", "T", "T", "T", "T", "T", "T", "A", "A", "T", "T", "...
## $ Medu       <int> 4, 1, 1, 4, 3, 4, 2, 4, 3, 3, 4, 2, 4, 4, 2, 4, 4, 3, 3,...
## $ Fedu       <int> 4, 1, 1, 2, 3, 3, 2, 4, 2, 4, 4, 1, 4, 3, 2, 4, 4, 3, 2,...
## $ Mjob       <chr> "at_home", "at_home", "at_home", "health", "other", "ser...
## $ Fjob       <chr> "teacher", "other", "other", "services", "other", "other...
## $ reason     <chr> "course", "course", "other", "home", "home", "reputation...
## $ nursery    <chr> "yes", "no", "yes", "yes", "yes", "yes", "yes", "yes", "...
## $ internet   <chr> "no", "yes", "yes", "yes", "no", "yes", "yes", "no", "ye...
## $ guardian   <chr> "mother", "father", "mother", "mother", "father", "mothe...
## $ traveltime <dbl> 2, 1, 1, 1, 1, 1, 1, 2, 1, 1, 1, 3, 1, 2, 1, 1, 1, 3, 1,...
## $ studytime  <dbl> 2, 2, 2, 3, 2, 2, 2, 2, 2, 2, 2, 3, 1, 2, 3, 1, 3, 2, 1,...
## $ failures   <dbl> 0, 0, 2, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 3,...
## $ schoolsup  <chr> "yes", "no", "yes", "no", "no", "no", "no", "yes", "no",...
## $ famsup     <chr> "no", "yes", "no", "yes", "yes", "yes", "no", "yes", "ye...
## $ paid       <chr> "no", "no", "yes", "yes", "yes", "yes", "no", "no", "yes...
## $ activities <chr> "no", "no", "no", "yes", "no", "yes", "no", "no", "no", ...
## $ higher     <chr> "yes", "yes", "yes", "yes", "yes", "yes", "yes", "yes", ...
## $ romantic   <chr> "no", "no", "no", "yes", "no", "no", "no", "no", "no", "...
## $ famrel     <dbl> 4, 5, 4, 3, 4, 5, 4, 4, 4, 5, 3, 5, 4, 5, 4, 4, 3, 5, 5,...
## $ freetime   <dbl> 3, 3, 3, 2, 3, 4, 4, 1, 2, 5, 3, 2, 3, 4, 5, 4, 2, 3, 5,...
## $ goout      <dbl> 4, 3, 2, 2, 2, 2, 4, 4, 2, 1, 3, 2, 3, 3, 2, 4, 3, 2, 5,...
## $ Dalc       <dbl> 1, 1, 2, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2,...
## $ Walc       <dbl> 1, 1, 3, 1, 2, 2, 1, 1, 1, 1, 2, 1, 3, 2, 1, 2, 2, 1, 4,...
## $ health     <dbl> 3, 3, 3, 5, 5, 5, 3, 1, 1, 5, 2, 4, 5, 3, 3, 2, 2, 4, 5,...
## $ absences   <dbl> 5, 3, 8, 1, 2, 8, 0, 4, 0, 0, 1, 2, 1, 1, 0, 5, 8, 3, 9,...
## $ G1         <dbl> 2, 7, 10, 14, 8, 14, 12, 8, 16, 13, 12, 10, 13, 11, 14, ...
## $ G2         <dbl> 8, 8, 10, 14, 12, 14, 12, 9, 17, 14, 11, 12, 14, 11, 15,...
## $ G3         <dbl> 8, 8, 11, 14, 12, 14, 12, 10, 18, 14, 12, 12, 13, 12, 16...

Define a new column alc_use by combining weekday and weekend alcohol use (i.e. columns Dalc and Walc, respectively)

alc <- mutate(alc, alc_use = (Dalc + Walc) / 2)

# load library for plotting
library(ggplot2)

# Let's plot the alcohol use based on gender
g1 <- ggplot(data = alc, aes(x = alc_use, fill = sex))

# define the plot as a bar plot and draw it
g1 + geom_bar()

# create a new logical column named 'high_use'. This column will make a True/Fale column if consuption is >2
alc <- mutate(alc, high_use = alc_use > 2)

# initialize a plot of 'high_use'
g2 <- ggplot(alc, aes(high_use))

# draw a bar plot of high_use by sex
g2 + facet_wrap("sex") + geom_bar()

EXTRA coding: Lets check it by table also

table(alc$high_use)
## 
## FALSE  TRUE 
##   268   114
"as it shows 114 students were consumed alcohol more than threshold (2) so were high use"
## [1] "as it shows 114 students were consumed alcohol more than threshold (2) so were high use"
table(alc$high_use)/nrow(alc)*100
## 
##    FALSE     TRUE 
## 70.15707 29.84293
"It means that 29.84% of students were high use, while majority (70.15%) were low use"
## [1] "It means that 29.84% of students were high use, while majority (70.15%) were low use"
table(alc$alc_use)/nrow(alc)*100
## 
##          1        1.5          2        2.5          3        3.5          4 
## 36.6492147 18.0628272 15.4450262 11.5183246  8.3769634  4.4502618  2.3560209 
##        4.5          5 
##  0.7853403  2.3560209
"as this shows, majoriy of them (36.65%) consumed just 1"
## [1] "as this shows, majoriy of them (36.65%) consumed just 1"
# Let's make a histogram and check the distribution
hist(alc$alc_use)

# Let's make a boxplot to further see the distribution
boxplot(alc$alc_use)

# read libraries
library(tidyr); library(dplyr); library(ggplot2)

# glimpse at the alc data
glimpse(alc) 
## Rows: 382
## Columns: 35
## $ school     <chr> "GP", "GP", "GP", "GP", "GP", "GP", "GP", "GP", "GP", "G...
## $ sex        <chr> "F", "F", "F", "F", "F", "M", "M", "F", "M", "M", "F", "...
## $ age        <int> 18, 17, 15, 15, 16, 16, 16, 17, 15, 15, 15, 15, 15, 15, ...
## $ address    <chr> "U", "U", "U", "U", "U", "U", "U", "U", "U", "U", "U", "...
## $ famsize    <chr> "GT3", "GT3", "LE3", "GT3", "GT3", "LE3", "LE3", "GT3", ...
## $ Pstatus    <chr> "A", "T", "T", "T", "T", "T", "T", "A", "A", "T", "T", "...
## $ Medu       <int> 4, 1, 1, 4, 3, 4, 2, 4, 3, 3, 4, 2, 4, 4, 2, 4, 4, 3, 3,...
## $ Fedu       <int> 4, 1, 1, 2, 3, 3, 2, 4, 2, 4, 4, 1, 4, 3, 2, 4, 4, 3, 2,...
## $ Mjob       <chr> "at_home", "at_home", "at_home", "health", "other", "ser...
## $ Fjob       <chr> "teacher", "other", "other", "services", "other", "other...
## $ reason     <chr> "course", "course", "other", "home", "home", "reputation...
## $ nursery    <chr> "yes", "no", "yes", "yes", "yes", "yes", "yes", "yes", "...
## $ internet   <chr> "no", "yes", "yes", "yes", "no", "yes", "yes", "no", "ye...
## $ guardian   <chr> "mother", "father", "mother", "mother", "father", "mothe...
## $ traveltime <dbl> 2, 1, 1, 1, 1, 1, 1, 2, 1, 1, 1, 3, 1, 2, 1, 1, 1, 3, 1,...
## $ studytime  <dbl> 2, 2, 2, 3, 2, 2, 2, 2, 2, 2, 2, 3, 1, 2, 3, 1, 3, 2, 1,...
## $ failures   <dbl> 0, 0, 2, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 3,...
## $ schoolsup  <chr> "yes", "no", "yes", "no", "no", "no", "no", "yes", "no",...
## $ famsup     <chr> "no", "yes", "no", "yes", "yes", "yes", "no", "yes", "ye...
## $ paid       <chr> "no", "no", "yes", "yes", "yes", "yes", "no", "no", "yes...
## $ activities <chr> "no", "no", "no", "yes", "no", "yes", "no", "no", "no", ...
## $ higher     <chr> "yes", "yes", "yes", "yes", "yes", "yes", "yes", "yes", ...
## $ romantic   <chr> "no", "no", "no", "yes", "no", "no", "no", "no", "no", "...
## $ famrel     <dbl> 4, 5, 4, 3, 4, 5, 4, 4, 4, 5, 3, 5, 4, 5, 4, 4, 3, 5, 5,...
## $ freetime   <dbl> 3, 3, 3, 2, 3, 4, 4, 1, 2, 5, 3, 2, 3, 4, 5, 4, 2, 3, 5,...
## $ goout      <dbl> 4, 3, 2, 2, 2, 2, 4, 4, 2, 1, 3, 2, 3, 3, 2, 4, 3, 2, 5,...
## $ Dalc       <dbl> 1, 1, 2, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2,...
## $ Walc       <dbl> 1, 1, 3, 1, 2, 2, 1, 1, 1, 1, 2, 1, 3, 2, 1, 2, 2, 1, 4,...
## $ health     <dbl> 3, 3, 3, 5, 5, 5, 3, 1, 1, 5, 2, 4, 5, 3, 3, 2, 2, 4, 5,...
## $ absences   <dbl> 5, 3, 8, 1, 2, 8, 0, 4, 0, 0, 1, 2, 1, 1, 0, 5, 8, 3, 9,...
## $ G1         <dbl> 2, 7, 10, 14, 8, 14, 12, 8, 16, 13, 12, 10, 13, 11, 14, ...
## $ G2         <dbl> 8, 8, 10, 14, 12, 14, 12, 9, 17, 14, 11, 12, 14, 11, 15,...
## $ G3         <dbl> 8, 8, 11, 14, 12, 14, 12, 10, 18, 14, 12, 12, 13, 12, 16...
## $ alc_use    <dbl> 1.0, 1.0, 2.5, 1.0, 1.5, 1.5, 1.0, 1.0, 1.0, 1.0, 1.5, 1...
## $ high_use   <lgl> FALSE, FALSE, TRUE, FALSE, FALSE, FALSE, FALSE, FALSE, F...
# use gather() to gather columns into key-value pairs and then glimpse() at the resulting data
gather(alc) %>% glimpse
## Rows: 13,370
## Columns: 2
## $ key   <chr> "school", "school", "school", "school", "school", "school", "...
## $ value <chr> "GP", "GP", "GP", "GP", "GP", "GP", "GP", "GP", "GP", "GP", "...
# draw a bar plot of each variable
gather(alc) %>% ggplot(aes(value)) + facet_wrap("key", scales = "free") + geom_bar()

# produce summary statistics by group
alc %>% group_by(sex, high_use) %>% summarise(count = n(), mean_grade = mean(G3))
## `summarise()` regrouping output by 'sex' (override with `.groups` argument)
## # A tibble: 4 x 4
## # Groups:   sex [2]
##   sex   high_use count mean_grade
##   <chr> <lgl>    <int>      <dbl>
## 1 F     FALSE      156       11.4
## 2 F     TRUE        42       11.7
## 3 M     FALSE      112       12.2
## 4 M     TRUE        72       10.3

Are you interested to create more plots, for example boxplot per groups?

# initialize a plot of high_use and G3
g1 <- ggplot(alc, aes(x = high_use, y = G3, col = sex))

# define the plot as a boxplot and draw it
g1 + geom_boxplot() + ylab("grade")

"So, as plots show, those who consumed alcohol highly, their grade reduced especially for males"
## [1] "So, as plots show, those who consumed alcohol highly, their grade reduced especially for males"
# initialise a plot of high_use and absences
g2 <- ggplot(alc, aes(x = high_use, y = absences, col = sex))

# define the plot as a boxplot and draw it
g2 + geom_boxplot() + ggtitle("Student absences by alcohol consumption and sex")

"As plots show, those who consumed alcohol highly, their absence were higher especially in males"
## [1] "As plots show, those who consumed alcohol highly, their absence were higher especially in males"
dim(alc)
## [1] 382  35
# Save files to your computer directory
write.csv(alc, "alc_joinet.csv") 
write.csv(alc, "students_math_por_joint.csv")



Part 2) Data analysis

Now that we prepared the data, let’s go for analysing the data

# read the data from your computer drive
data_student_perf_alc = read.csv("alc_joinet.csv")

#Alternatively you can read from the URL:
data_student_perf_alc <- read.table("http://s3.amazonaws.com/assets.datacamp.com/production/course_2218/datasets/alc.txt", sep=",", header=TRUE)

Data description:
This data approach student achievement in secondary education of two Portuguese schools. The data attributes include student grades, demographic, social and school related features) and it was collected by using school reports and questionnaires. Two datasets are provided regarding the performance in two distinct subjects: Mathematics (mat) and Portuguese language (por). We joint the two dataset and now are ready to go for analysis phase.
Thus, we will use this dataset to analysis the relationships between high/low alcohol consumption and other variables. Is there any relationship between students performance and alcohol? Logistic regression will be used. ___For more info visit: https://archive.ics.uci.edu/ml/datasets/Student+Performance___

# Variables names are as following
colnames(data_student_perf_alc)
##  [1] "school"     "sex"        "age"        "address"    "famsize"   
##  [6] "Pstatus"    "Medu"       "Fedu"       "Mjob"       "Fjob"      
## [11] "reason"     "nursery"    "internet"   "guardian"   "traveltime"
## [16] "studytime"  "failures"   "schoolsup"  "famsup"     "paid"      
## [21] "activities" "higher"     "romantic"   "famrel"     "freetime"  
## [26] "goout"      "Dalc"       "Walc"       "health"     "absences"  
## [31] "G1"         "G2"         "G3"         "alc_use"    "high_use"

I would like to select 4 interesting variables to start modelling. they are as following:
1. Internet: students with high Internet access at home, will consume less alcohol
2. goout: students who go out with friend more, will consume more alcohol
3. absences: students who are often absent from class, consume more alcohol
4. activities: students with higher extra-curricular activities will consume less alcohol


# select only those columns
variables <- c("internet", "goout", "absences", "activities", 
               "alc_use", "high_use")

dt_some_col =select(data_student_perf_alc, variables)
## Note: Using an external vector in selections is ambiguous.
## i Use `all_of(variables)` instead of `variables` to silence this message.
## i See <https://tidyselect.r-lib.org/reference/faq-external-vector.html>.
## This message is displayed once per session.
dim(dt_some_col)
## [1] 382   6

Draw a bar plot of each variable

gather(dt_some_col) %>% ggplot(aes(value)) + facet_wrap("key", scales = "free") + geom_bar()

# see the proportional table
table(dt_some_col$absences)/nrow(dt_some_col)*100
## 
##          0          1          2          3          4          5          6 
## 30.3664921  0.7853403 17.5392670  1.5706806 13.3507853  1.0471204  7.8534031 
##          7          8          9         10         11         12         13 
##  2.0942408  5.4973822  0.7853403  4.4502618  0.5235602  2.6178010  0.7853403 
##         14         15         16         17         18         19         20 
##  2.8795812  0.7853403  1.5706806  0.2617801  1.0471204  0.2617801  0.5235602 
##         21         22         23         24         25         26         28 
##  0.2617801  0.7853403  0.2617801  0.2617801  0.2617801  0.2617801  0.2617801 
##         30         54         56         75 
##  0.2617801  0.2617801  0.2617801  0.2617801
#data structure
str(dt_some_col)
## 'data.frame':    382 obs. of  6 variables:
##  $ internet  : chr  "no" "yes" "yes" "yes" ...
##  $ goout     : int  4 3 2 2 2 2 4 4 2 1 ...
##  $ absences  : int  6 4 10 2 4 10 0 6 0 0 ...
##  $ activities: chr  "no" "no" "no" "yes" ...
##  $ alc_use   : num  1 1 2.5 1 1.5 1.5 1 1 1 1 ...
##  $ high_use  : logi  FALSE FALSE TRUE FALSE FALSE FALSE ...
#Crosstable
cross_table = xtabs(~alc_use + internet, data = dt_some_col)
cross_table
##        internet
## alc_use  no yes
##     1    27 117
##     1.5   7  61
##     2     9  49
##     2.5   5  37
##     3     6  26
##     3.5   2  15
##     4     2   7
##     4.5   0   3
##     5     0   9

EXTRA coding: Calculate the proportion of each variable in table, rounding to 2 decimals

round(prop.table(cross_table), 2)
##        internet
## alc_use   no  yes
##     1   0.07 0.31
##     1.5 0.02 0.16
##     2   0.02 0.13
##     2.5 0.01 0.10
##     3   0.02 0.07
##     3.5 0.01 0.04
##     4   0.01 0.02
##     4.5 0.00 0.01
##     5   0.00 0.02
#shows if distrubution is significantly differnet
chisq.test(cross_table)
## Warning in chisq.test(cross_table): Chi-squared approximation may be incorrect
## 
##  Pearson's Chi-squared test
## 
## data:  cross_table
## X-squared = 6.0051, df = 8, p-value = 0.6467

As p-value > 0.05, thus the distrubution of this variable (internet access at home) is not significantly differnet
Note: don’t worry about the warning message, becuase the number of data is few.


For nother variable

#Crosstable
cross_table = xtabs(~alc_use + goout, data = dt_some_col)
cross_table
##        goout
## alc_use  1  2  3  4  5
##     1   16 49 45 25  9
##     1.5  3 22 29  8  6
##     2    2 13 26 10  7
##     2.5  2  7 13 16  4
##     3    0  4  3 14 11
##     3.5  1  2  4  3  7
##     4    0  0  1  5  3
##     4.5  0  1  0  1  1
##     5    0  1  2  0  6
# to get proportion of each, with 2 decimal rounded
round(prop.table(cross_table), 2)
##        goout
## alc_use    1    2    3    4    5
##     1   0.04 0.13 0.12 0.07 0.02
##     1.5 0.01 0.06 0.08 0.02 0.02
##     2   0.01 0.03 0.07 0.03 0.02
##     2.5 0.01 0.02 0.03 0.04 0.01
##     3   0.00 0.01 0.01 0.04 0.03
##     3.5 0.00 0.01 0.01 0.01 0.02
##     4   0.00 0.00 0.00 0.01 0.01
##     4.5 0.00 0.00 0.00 0.00 0.00
##     5   0.00 0.00 0.01 0.00 0.02
#if distrubution is signoficantly differnet
chisq.test(cross_table)
## Warning in chisq.test(cross_table): Chi-squared approximation may be incorrect
## 
##  Pearson's Chi-squared test
## 
## data:  cross_table
## X-squared = 108.14, df = 32, p-value = 3.414e-10

As p-value < 0.05, thus the distrubution of this variable (going out behaviour) is significantly differnet
Note: don’t worry about the warning message, becuase the number of data is few.


Yet, for another variable

#Crosstable
cross_table = xtabs(~alc_use + absences, data = dt_some_col)
cross_table
##        absences
## alc_use  0  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20 21 22 23
##     1   58  2 28  0 17  0 11  3  8  2  5  0  3  0  1  1  1  0  1  0  1  0  0  0
##     1.5 22  0 16  0 11  1  4  0  2  1  4  0  2  0  2  0  0  0  1  0  0  0  0  1
##     2   14  0 10  2  8  3  4  3  3  0  3  1  1  0  4  0  1  0  0  0  0  1  0  0
##     2.5  7  1  5  2  4  0  5  2  1  0  2  1  3  0  1  0  1  1  1  0  1  0  2  0
##     3    8  0  7  0  5  0  1  0  3  0  1  0  0  1  0  2  1  0  0  0  0  0  1  0
##     3.5  5  0  1  1  2  0  1  0  2  0  1  0  0  1  1  0  0  0  0  1  0  0  0  0
##     4    1  0  0  1  2  0  3  0  1  0  0  0  0  0  0  0  0  0  1  0  0  0  0  0
##     4.5  0  0  0  0  0  0  0  0  0  0  0  0  1  1  1  0  0  0  0  0  0  0  0  0
##     5    1  0  0  0  2  0  1  0  1  0  1  0  0  0  1  0  2  0  0  0  0  0  0  0
##        absences
## alc_use 24 25 26 28 30 54 56 75
##     1    0  0  1  0  0  0  0  1
##     1.5  0  1  0  0  0  0  0  0
##     2    0  0  0  0  0  0  0  0
##     2.5  1  0  0  0  0  0  1  0
##     3    0  0  0  0  1  1  0  0
##     3.5  0  0  0  1  0  0  0  0
##     4    0  0  0  0  0  0  0  0
##     4.5  0  0  0  0  0  0  0  0
##     5    0  0  0  0  0  0  0  0
# to get proportion of each, with 2 decimal rounded
round(prop.table(cross_table), 2)
##        absences
## alc_use    0    1    2    3    4    5    6    7    8    9   10   11   12   13
##     1   0.15 0.01 0.07 0.00 0.04 0.00 0.03 0.01 0.02 0.01 0.01 0.00 0.01 0.00
##     1.5 0.06 0.00 0.04 0.00 0.03 0.00 0.01 0.00 0.01 0.00 0.01 0.00 0.01 0.00
##     2   0.04 0.00 0.03 0.01 0.02 0.01 0.01 0.01 0.01 0.00 0.01 0.00 0.00 0.00
##     2.5 0.02 0.00 0.01 0.01 0.01 0.00 0.01 0.01 0.00 0.00 0.01 0.00 0.01 0.00
##     3   0.02 0.00 0.02 0.00 0.01 0.00 0.00 0.00 0.01 0.00 0.00 0.00 0.00 0.00
##     3.5 0.01 0.00 0.00 0.00 0.01 0.00 0.00 0.00 0.01 0.00 0.00 0.00 0.00 0.00
##     4   0.00 0.00 0.00 0.00 0.01 0.00 0.01 0.00 0.00 0.00 0.00 0.00 0.00 0.00
##     4.5 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
##     5   0.00 0.00 0.00 0.00 0.01 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
##        absences
## alc_use   14   15   16   17   18   19   20   21   22   23   24   25   26   28
##     1   0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
##     1.5 0.01 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
##     2   0.01 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
##     2.5 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.01 0.00 0.00 0.00 0.00 0.00
##     3   0.00 0.01 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
##     3.5 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
##     4   0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
##     4.5 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
##     5   0.00 0.00 0.01 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
##        absences
## alc_use   30   54   56   75
##     1   0.00 0.00 0.00 0.00
##     1.5 0.00 0.00 0.00 0.00
##     2   0.00 0.00 0.00 0.00
##     2.5 0.00 0.00 0.00 0.00
##     3   0.00 0.00 0.00 0.00
##     3.5 0.00 0.00 0.00 0.00
##     4   0.00 0.00 0.00 0.00
##     4.5 0.00 0.00 0.00 0.00
##     5   0.00 0.00 0.00 0.00
#if distrubution is signoficantly differnet
chisq.test(cross_table)
## Warning in chisq.test(cross_table): Chi-squared approximation may be incorrect
## 
##  Pearson's Chi-squared test
## 
## data:  cross_table
## X-squared = 348.17, df = 248, p-value = 2.751e-05

As p-value < 0.05, thus the distribution of this variable (absence) is significantly different.
Note: don’t worry about the warning message, because the number of data is few.



Yet, for nother variable

#Crosstable
cross_table = xtabs(~alc_use + activities, data = dt_some_col)
cross_table
##        activities
## alc_use no yes
##     1   58  86
##     1.5 40  28
##     2   24  34
##     2.5 22  20
##     3   18  14
##     3.5 10   7
##     4    4   5
##     4.5  1   2
##     5    4   5
# to get proportion of each, with 2 decimal rounded
round(prop.table(cross_table), 2)
##        activities
## alc_use   no  yes
##     1   0.15 0.23
##     1.5 0.10 0.07
##     2   0.06 0.09
##     2.5 0.06 0.05
##     3   0.05 0.04
##     3.5 0.03 0.02
##     4   0.01 0.01
##     4.5 0.00 0.01
##     5   0.01 0.01
#if distrubution is signoficantly differnet
chisq.test(cross_table)
## Warning in chisq.test(cross_table): Chi-squared approximation may be incorrect
## 
##  Pearson's Chi-squared test
## 
## data:  cross_table
## X-squared = 9.9466, df = 8, p-value = 0.2688

As p-value > 0.05, thus the distribution of this variable (extra-curricular activities) is not significantly different.
Note: don’t worry about the warning message, because the number of data is few.


Let’s barplot the variables of interest

barplot(sort(table(dt_some_col$internet),decreasing=T))

barplot(sort(table(dt_some_col$goout),decreasing=T))

barplot(sort(table(dt_some_col$absences),decreasing=T))

barplot(sort(table(dt_some_col$activities),decreasing=T))

barplot(sort(table(dt_some_col$alc_use),decreasing=T))

barplot(sort(table(dt_some_col$high_use),decreasing=T))


Let’s boxplot the numerical variables of interest

boxplot(dt_some_col$goout)

boxplot(dt_some_col$absences)

boxplot(dt_some_col$alc_use)

And some more

plot(alc_use ~ goout, data=dt_some_col)

plot(alc_use ~ absences, data=dt_some_col)

Alternatively, this code also does the same: “table(dt_some_col\(activities, dt_some_col\)alc_use)”



## Apply logistic regression model

#set binary variables as factor
dt_some_col$internet <- factor(dt_some_col$internet)
dt_some_col$activities <- factor(dt_some_col$activities)


# make the model with glm()
regres_model <- glm(high_use ~ internet + goout + absences + activities, data = dt_some_col, family = "binomial")

# print out a summary of the model
summary(regres_model)
## 
## Call:
## glm(formula = high_use ~ internet + goout + absences + activities, 
##     family = "binomial", data = dt_some_col)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.0141  -0.7760  -0.5282   0.9018   2.3876  
## 
## Coefficients:
##               Estimate Std. Error z value Pr(>|z|)    
## (Intercept)   -3.47384    0.51472  -6.749 1.49e-11 ***
## internetyes   -0.08489    0.35507  -0.239 0.811041    
## goout          0.76811    0.11960   6.422 1.34e-10 ***
## absences       0.06175    0.01701   3.629 0.000284 ***
## activitiesyes -0.44819    0.25086  -1.787 0.074000 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 462.21  on 381  degrees of freedom
## Residual deviance: 393.79  on 377  degrees of freedom
## AIC: 403.79
## 
## Number of Fisher Scoring iterations: 4

Based on model summary absence is statistically significant relationship with high alcohol use, because the probability value Pr(>|z|) = 0.000284, thus the model is valid in >99% significance level.
Students with hight going out behaviour, consumed more alcohol. It is statistically proved because Pr(>|z|) = 1.34e-10, thus our hypothesis is accepted in >99% significance level.
However, having Internet access seemed not to have relationship with those who had the Internet at home. Because the Pr(>|z|) for this was >0.05, thus is not statistically significant (at least in 95%). Thus, our hypothesis that we assumed those who have Internet at home will not consume more alcohol, would be rejected.
It is the same as our conclusion of chisq.test explained above!
Notably, the statistical relationship between having more extra-curriculum activities and alcohol consumption was not valid in 95% significance level, however in 90% because the P value is 0.074. So, it shall be rejected, and we can ignore this variable.

Hence, in fact our model would be
alcohol consumption = goingout0.76811 + absence0.06175 + (-3.47384), becuase y = ax1 + bx2 + c

# print out the coefficients of the model
coef(regres_model)
##   (Intercept)   internetyes         goout      absences activitiesyes 
##   -3.47383750   -0.08489077    0.76810971    0.06174629   -0.44819206
# compute odds ratios (OR)
OR <- coef(regres_model) %>% exp

# compute confidence intervals (CI)
CI <- confint(regres_model) %>% exp
## Waiting for profiling to be done...
# bind odds ratios (OR) and confidence intervals (CI) columns together
cbind(OR, CI)
##                       OR      2.5 %     97.5 %
## (Intercept)   0.03099785 0.01084108 0.08194402
## internetyes   0.91861262 0.46457217 1.88377842
## goout         2.15568752 1.71547363 2.74450680
## absences      1.06369244 1.03125278 1.10286289
## activitiesyes 0.63878199 0.38885454 1.04178158

As it shows, two variables of going out and absence has high odds ratio (same as probability in non-logistic regression models). And their confidence interval is still good because it’s very close interval, so the most of values in the variable are like that, thus with high OR value.

# predict() the probability of high_use
probabilities <- predict(regres_model, type = "response")

# add the predicted probabilities to 'dt_some_col'
dt_some_col <- mutate(dt_some_col, probability = probabilities)

# use the probabilities to make a prediction of high_use
dt_some_col <- mutate(dt_some_col, prediction = probability > 0.5)

# see the last ten original classes, predicted probabilities, and class predictions
select(dt_some_col, internet, goout, absences, activities, probability, prediction) %>% tail(10)
##     internet goout absences activities probability prediction
## 373       no     2        0         no  0.12590977      FALSE
## 374       no     3       14         no  0.42432092      FALSE
## 375       no     3        2         no  0.25999091      FALSE
## 376      yes     3        7        yes  0.21919441      FALSE
## 377      yes     2        0        yes  0.07793785      FALSE
## 378      yes     4        0         no  0.38076807      FALSE
## 379       no     1        0        yes  0.04093710      FALSE
## 380       no     1        0        yes  0.04093710      FALSE
## 381      yes     5        3         no  0.61468746       TRUE
## 382      yes     1        0         no  0.05783324      FALSE
# tabulate the target variable versus the predictions
table(high_use = dt_some_col$high_use, prediction = dt_some_col$prediction)
##         prediction
## high_use FALSE TRUE
##    FALSE   247   23
##    TRUE     71   41

Now, 2x2 cross tabulation of predictions versus the actual values
As it shows, 247 cases of high_use alcohol consumptions were false and were predicted correctly as false, 71 of them were miscalssified.



# EXTRA coding: Create a nice graph with confusion matrix

# insprited from https://stackoverflow.com/a/64539733
library('caret')
## Loading required package: lattice
cm <- confusionMatrix(factor(dt_some_col$prediction), factor(dt_some_col$high_use), dnn = c("Prediction", "Reference"))

ggplot(as.data.frame(cm$table), aes(Prediction,sort(Reference,decreasing = T), fill= Freq)) +
  geom_tile() + geom_text(aes(label=Freq)) +
  scale_fill_gradient(low="white", high="#009193") +
  labs(x = "Prediction",y = "Reference") +
  scale_x_discrete(labels=c("False","True")) +
  scale_y_discrete(labels=c("True","False"))

Good plots!
As explained above, it shows that 247 cases of high_use alcohol consumptions were false and were predicted correctly as false, 71 of them were miscalssifed.

# initialize a plot of 'high_use' versus 'probability' in 'dt_some_col'
g <- ggplot(dt_some_col, aes(x = probability, y = high_use, col = prediction))

# define the geom as points and draw the plot
g + geom_point()

# tabulate the target variable versus the predictions
table(high_use = dt_some_col$high_use, prediction = dt_some_col$prediction) %>% prop.table %>% addmargins
##         prediction
## high_use      FALSE       TRUE        Sum
##    FALSE 0.64659686 0.06020942 0.70680628
##    TRUE  0.18586387 0.10732984 0.29319372
##    Sum   0.83246073 0.16753927 1.00000000

Define a loss function (average prediction error)

loss_func <- function(class, prob) {
  n_wrong <- abs(class - prob) > 0.5
  mean(n_wrong)
}

# call loss_func to compute the average number of wrong predictions in the (training) data
loss_func(class = dt_some_col$high_use, prob = dt_some_col$probability)
## [1] 0.2460733

Bonus: K-fold cross-validation

Perform leave-one-out cross-validation and print out the mean prediction error for the testing data. It divides the observations into k number of folds, leaves-one-out for validation and uses the rest for training, for example, if k=5, uses 4-folds for training and 1-fold for validation

#load library
library(boot)
## 
## Attaching package: 'boot'
## The following object is masked from 'package:lattice':
## 
##     melanoma
# cv.glm function from the 'boot' library computes the error and stores it in delta
cv <- cv.glm(data = dt_some_col, cost = loss_func, glmfit = regres_model, K = 2)

# average number of wrong predictions in the cross validation
cv$delta[1]
## [1] 0.2617801

EXTRA coding: Make a loop to try different k values

k = c(2,3,4,5,6,7,8,9,10,11,12,13,14,15,20,25)
for (i in k){
  print(i)
  cv <- cv.glm(data = dt_some_col, cost = loss_func, glmfit = regres_model, K = i)
  # average number of wrong predictions in the cross validation
  print (cv$delta[1])}
## [1] 2
## [1] 0.2591623
## [1] 3
## [1] 0.2539267
## [1] 4
## [1] 0.2696335
## [1] 5
## [1] 0.2565445
## [1] 6
## [1] 0.2486911
## [1] 7
## [1] 0.2591623
## [1] 8
## [1] 0.2434555
## [1] 9
## [1] 0.2539267
## [1] 10
## [1] 0.2539267
## [1] 11
## [1] 0.2617801
## [1] 12
## [1] 0.2539267
## [1] 13
## [1] 0.2696335
## [1] 14
## [1] 0.2513089
## [1] 15
## [1] 0.2539267
## [1] 20
## [1] 0.2539267
## [1] 25
## [1] 0.2591623

So, the accuracy of model improved by hyper tuning using cross-validation values in loop (above), because the, average number of wrong predictions (shown by cv$delta[1] in above) decreased a bit.



## Super-Bonus: Create logistic regression with other variables
I like to add two variables such as romantic (weather or not in a romantic relationship) and famrel (the quality of family relationship).
I keep the previously significant variables

colnames(data_student_perf_alc)
##  [1] "school"     "sex"        "age"        "address"    "famsize"   
##  [6] "Pstatus"    "Medu"       "Fedu"       "Mjob"       "Fjob"      
## [11] "reason"     "nursery"    "internet"   "guardian"   "traveltime"
## [16] "studytime"  "failures"   "schoolsup"  "famsup"     "paid"      
## [21] "activities" "higher"     "romantic"   "famrel"     "freetime"  
## [26] "goout"      "Dalc"       "Walc"       "health"     "absences"  
## [31] "G1"         "G2"         "G3"         "alc_use"    "high_use"
variables <- c("romantic", "goout", "absences", "famrel", 
               "alc_use", "high_use")
dt_some_col =select(data_student_perf_alc, variables)
dim(dt_some_col)
## [1] 382   6
# draw a bar plot of each variable
gather(dt_some_col) %>% ggplot(aes(value)) + facet_wrap("key", scales = "free") + geom_bar()

table(dt_some_col$absences)/nrow(dt_some_col)*100
## 
##          0          1          2          3          4          5          6 
## 30.3664921  0.7853403 17.5392670  1.5706806 13.3507853  1.0471204  7.8534031 
##          7          8          9         10         11         12         13 
##  2.0942408  5.4973822  0.7853403  4.4502618  0.5235602  2.6178010  0.7853403 
##         14         15         16         17         18         19         20 
##  2.8795812  0.7853403  1.5706806  0.2617801  1.0471204  0.2617801  0.5235602 
##         21         22         23         24         25         26         28 
##  0.2617801  0.7853403  0.2617801  0.2617801  0.2617801  0.2617801  0.2617801 
##         30         54         56         75 
##  0.2617801  0.2617801  0.2617801  0.2617801
barplot(sort(table(dt_some_col$romantic),decreasing=T))

barplot(sort(table(dt_some_col$goout),decreasing=T))

barplot(sort(table(dt_some_col$absences),decreasing=T))

barplot(sort(table(dt_some_col$famrel),decreasing=T))

barplot(sort(table(dt_some_col$alc_use),decreasing=T))

barplot(sort(table(dt_some_col$high_use),decreasing=T))

#set binary variables as factor
dt_some_col$romantic <- factor(dt_some_col$romantic)

# find the model with glm()


regres_model <- glm(high_use ~ romantic + goout + absences + famrel, data = dt_some_col, family = "binomial")

# print out a summary of the model
summary(regres_model)
## 
## Call:
## glm(formula = high_use ~ romantic + goout + absences + famrel, 
##     family = "binomial", data = dt_some_col)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.9599  -0.7789  -0.5080   0.8454   2.4681  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -2.15491    0.63232  -3.408 0.000655 ***
## romanticyes -0.37335    0.27759  -1.345 0.178641    
## goout        0.79978    0.12171   6.571 4.99e-11 ***
## absences     0.06001    0.01641   3.656 0.000256 ***
## famrel      -0.41049    0.13511  -3.038 0.002380 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 462.21  on 381  degrees of freedom
## Residual deviance: 386.46  on 377  degrees of freedom
## AIC: 396.46
## 
## Number of Fisher Scoring iterations: 4

Based on model summary the quality of family relationship (famrel) has statistically significant relationship with high alcohol use, because the probability value Pr(>|z|) = 0.002380, thus the model is valid in >99% significance level.

Students with hight going out behaviour, consumed more alcohol. It is statistically proved becuase Pr(>|z|) = 1.34e-10, thus our hypothesis is accepted in >99% significance level.
Thus, this model would in fact be model be like this:
alcohol consumption = goingout0.79978 + absence0.06001 + romanticyes * -0.37335 + (-2.15491), becuase y = ax1 + bx2 + cx3 + c

# print out the coefficients of the model
coef(regres_model)
## (Intercept) romanticyes       goout    absences      famrel 
## -2.15490562 -0.37335072  0.79978011  0.06000845 -0.41048560
# compute odds ratios (OR)
OR <- coef(regres_model) %>% exp

# compute confidence intervals (CI)
CI <- confint(regres_model) %>% exp
## Waiting for profiling to be done...
# bind odds ratios (OR) and confidence intervals (CI) columns together
cbind(OR, CI)
##                    OR      2.5 %    97.5 %
## (Intercept) 0.1159141 0.03245399 0.3900847
## romanticyes 0.6884237 0.39504416 1.1763720
## goout       2.2250516 1.76429361 2.8462267
## absences    1.0618455 1.03000291 1.0997256
## famrel      0.6633281 0.50735265 0.8632692
# predict() the probability of high_use
probabilities <- predict(regres_model, type = "response")

# add the predicted probabilities to 'dt_some_col'
dt_some_col <- mutate(dt_some_col, probability = probabilities)

# use the probabilities to make a prediction of high_use
dt_some_col <- mutate(dt_some_col, prediction = probability > 0.5)

# see the last ten original classes, predicted probabilities, and class predictions
select(dt_some_col, romantic , goout , absences , famrel, probability, prediction) %>% tail(10)
##     romantic goout absences famrel probability prediction
## 373       no     2        0      4  0.09999431      FALSE
## 374       no     3       14      5  0.27530426      FALSE
## 375       no     3        2      5  0.15604215      FALSE
## 376      yes     3        7      4  0.20573973      FALSE
## 377       no     2        0      5  0.06863981      FALSE
## 378       no     4        0      4  0.35486376      FALSE
## 379       no     1        0      1  0.14608898      FALSE
## 380       no     1        0      1  0.14608898      FALSE
## 381       no     5        3      2  0.76906675       TRUE
## 382       no     1        0      4  0.04755851      FALSE
# tabulate the target variable versus the predictions
table(high_use = dt_some_col$high_use, prediction = dt_some_col$prediction)
##         prediction
## high_use FALSE TRUE
##    FALSE   251   19
##    TRUE     66   46

As shown above, 251 cases of high_use alcohol consumptions were false and were predicted correctly as false, 66 of them were miscalssifed, so the accuracy improved.
Moreover, the numbers in diagonal (251 and 46) were increased compared to above model, which shoes the increase of the corrected predicted values, and hence, increase in model accuracy

# call loss_func to compute the average number of wrong predictions in the (training) data
loss_func(class = dt_some_col$high_use, prob = dt_some_col$probability)
## [1] 0.2225131

K-fold cross-validation
As explained also above, let’s perform leave-one-out cross-validation and print out the mean prediction error for the testing data
It divids the observations into k number of folds, leaves-one-out for validation and uses the rest for training, for example, if k=5, uses 4 folds for training and 1 fold for validation

#load library
library(boot)
# cv.glm function from the 'boot' library computes the error and stores it in delta
cv <- cv.glm(data = dt_some_col, cost = loss_func, glmfit = regres_model, K = 2)

# average number of wrong predictions in the cross validation
cv$delta[1]
## [1] 0.2251309

EXTRA coding: Try different k values using a loop for new model

k = c(2,3,4,5,6,7,8,9,10,11,12,13,14,15,20,25)
for (i in k){
  print(i)
  cv <- cv.glm(data = dt_some_col, cost = loss_func, glmfit = regres_model, K = i)
  # average number of wrong predictions in the cross validation
  print (cv$delta[1])}
## [1] 2
## [1] 0.2198953
## [1] 3
## [1] 0.2434555
## [1] 4
## [1] 0.2382199
## [1] 5
## [1] 0.2303665
## [1] 6
## [1] 0.2251309
## [1] 7
## [1] 0.2277487
## [1] 8
## [1] 0.2303665
## [1] 9
## [1] 0.2277487
## [1] 10
## [1] 0.2277487
## [1] 11
## [1] 0.2251309
## [1] 12
## [1] 0.2329843
## [1] 13
## [1] 0.2277487
## [1] 14
## [1] 0.2303665
## [1] 15
## [1] 0.2382199
## [1] 20
## [1] 0.2251309
## [1] 25
## [1] 0.2277487

So, the accuracy of crossvalidation improved by adding the new variable (quality of family relation), because the average number of wrong predictions (shown by cv$delta[1] in above) decreased


About the project

Welcome to my page

This is the learning diary for the course ‘Introduction to Open Data Science 2020’. I would like to further develop my data science skills in R and get familiar with Github.
My GitHub repository is https://github.com/Imangholiloo/IODS-project

# Let's get the date and time
date()
## [1] "Wed Nov 25 19:30:19 2020"

alternatively, print(Sys.time())

————————————————————————————

Exercise 4: Clustering and classification

Part 1) Data wrangling

#load the required library
library(MASS)
## 
## Attaching package: 'MASS'
## The following object is masked from 'package:dplyr':
## 
##     select
#load the data
data("Boston")


Data description:
This dataset is about housing values in suburbs of Boston. More info can be found in: https://stat.ethz.ch/R-manual/R-devel/library/MASS/html/Boston.html

#explore the dataset
str(Boston)
## 'data.frame':    506 obs. of  14 variables:
##  $ crim   : num  0.00632 0.02731 0.02729 0.03237 0.06905 ...
##  $ zn     : num  18 0 0 0 0 0 12.5 12.5 12.5 12.5 ...
##  $ indus  : num  2.31 7.07 7.07 2.18 2.18 2.18 7.87 7.87 7.87 7.87 ...
##  $ chas   : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ nox    : num  0.538 0.469 0.469 0.458 0.458 0.458 0.524 0.524 0.524 0.524 ...
##  $ rm     : num  6.58 6.42 7.18 7 7.15 ...
##  $ age    : num  65.2 78.9 61.1 45.8 54.2 58.7 66.6 96.1 100 85.9 ...
##  $ dis    : num  4.09 4.97 4.97 6.06 6.06 ...
##  $ rad    : int  1 2 2 3 3 3 5 5 5 5 ...
##  $ tax    : num  296 242 242 222 222 222 311 311 311 311 ...
##  $ ptratio: num  15.3 17.8 17.8 18.7 18.7 18.7 15.2 15.2 15.2 15.2 ...
##  $ black  : num  397 397 393 395 397 ...
##  $ lstat  : num  4.98 9.14 4.03 2.94 5.33 ...
##  $ medv   : num  24 21.6 34.7 33.4 36.2 28.7 22.9 27.1 16.5 18.9 ...
#summary of the data
summary(Boston)
##       crim                zn             indus            chas        
##  Min.   : 0.00632   Min.   :  0.00   Min.   : 0.46   Min.   :0.00000  
##  1st Qu.: 0.08205   1st Qu.:  0.00   1st Qu.: 5.19   1st Qu.:0.00000  
##  Median : 0.25651   Median :  0.00   Median : 9.69   Median :0.00000  
##  Mean   : 3.61352   Mean   : 11.36   Mean   :11.14   Mean   :0.06917  
##  3rd Qu.: 3.67708   3rd Qu.: 12.50   3rd Qu.:18.10   3rd Qu.:0.00000  
##  Max.   :88.97620   Max.   :100.00   Max.   :27.74   Max.   :1.00000  
##       nox               rm             age              dis        
##  Min.   :0.3850   Min.   :3.561   Min.   :  2.90   Min.   : 1.130  
##  1st Qu.:0.4490   1st Qu.:5.886   1st Qu.: 45.02   1st Qu.: 2.100  
##  Median :0.5380   Median :6.208   Median : 77.50   Median : 3.207  
##  Mean   :0.5547   Mean   :6.285   Mean   : 68.57   Mean   : 3.795  
##  3rd Qu.:0.6240   3rd Qu.:6.623   3rd Qu.: 94.08   3rd Qu.: 5.188  
##  Max.   :0.8710   Max.   :8.780   Max.   :100.00   Max.   :12.127  
##       rad              tax           ptratio          black       
##  Min.   : 1.000   Min.   :187.0   Min.   :12.60   Min.   :  0.32  
##  1st Qu.: 4.000   1st Qu.:279.0   1st Qu.:17.40   1st Qu.:375.38  
##  Median : 5.000   Median :330.0   Median :19.05   Median :391.44  
##  Mean   : 9.549   Mean   :408.2   Mean   :18.46   Mean   :356.67  
##  3rd Qu.:24.000   3rd Qu.:666.0   3rd Qu.:20.20   3rd Qu.:396.23  
##  Max.   :24.000   Max.   :711.0   Max.   :22.00   Max.   :396.90  
##      lstat            medv      
##  Min.   : 1.73   Min.   : 5.00  
##  1st Qu.: 6.95   1st Qu.:17.02  
##  Median :11.36   Median :21.20  
##  Mean   :12.65   Mean   :22.53  
##  3rd Qu.:16.95   3rd Qu.:25.00  
##  Max.   :37.97   Max.   :50.00

The summary shows minimum, mean, median, maximun and first and thrid quantiles for each columns of the dataset

#check dimention of the data
dim(Boston)
## [1] 506  14

As it shows, this dataset has 506 rows (observations) and 14 columns (variables)

# See head (fist 5 rows) of the file
head(Boston)
##      crim zn indus chas   nox    rm  age    dis rad tax ptratio  black lstat
## 1 0.00632 18  2.31    0 0.538 6.575 65.2 4.0900   1 296    15.3 396.90  4.98
## 2 0.02731  0  7.07    0 0.469 6.421 78.9 4.9671   2 242    17.8 396.90  9.14
## 3 0.02729  0  7.07    0 0.469 7.185 61.1 4.9671   2 242    17.8 392.83  4.03
## 4 0.03237  0  2.18    0 0.458 6.998 45.8 6.0622   3 222    18.7 394.63  2.94
## 5 0.06905  0  2.18    0 0.458 7.147 54.2 6.0622   3 222    18.7 396.90  5.33
## 6 0.02985  0  2.18    0 0.458 6.430 58.7 6.0622   3 222    18.7 394.12  5.21
##   medv
## 1 24.0
## 2 21.6
## 3 34.7
## 4 33.4
## 5 36.2
## 6 28.7
# See tail (last 5 rows) of the file
tail(Boston)
##        crim zn indus chas   nox    rm  age    dis rad tax ptratio  black lstat
## 501 0.22438  0  9.69    0 0.585 6.027 79.7 2.4982   6 391    19.2 396.90 14.33
## 502 0.06263  0 11.93    0 0.573 6.593 69.1 2.4786   1 273    21.0 391.99  9.67
## 503 0.04527  0 11.93    0 0.573 6.120 76.7 2.2875   1 273    21.0 396.90  9.08
## 504 0.06076  0 11.93    0 0.573 6.976 91.0 2.1675   1 273    21.0 396.90  5.64
## 505 0.10959  0 11.93    0 0.573 6.794 89.3 2.3889   1 273    21.0 393.45  6.48
## 506 0.04741  0 11.93    0 0.573 6.030 80.8 2.5050   1 273    21.0 396.90  7.88
##     medv
## 501 16.8
## 502 22.4
## 503 20.6
## 504 23.9
## 505 22.0
## 506 11.9
#see column names
colnames(Boston)
##  [1] "crim"    "zn"      "indus"   "chas"    "nox"     "rm"      "age"    
##  [8] "dis"     "rad"     "tax"     "ptratio" "black"   "lstat"   "medv"
# plot matrix of the variables
pairs(Boston)

You can see pair-wise plots of each variable (column) with each other and how they were distributed.

library(ggplot2)
library(GGally)
p <- ggpairs(Boston, mapping = aes(), lower = list(combo = wrap("facethist", bins = 20)))
p

Still some more plots to that show correction of each column (variable) with others and the distribution.

Let’s make also boxplot to visually see the distribution, min, max and quantiles and mean of each variable

boxplot(Boston,
        main = "Boxplot of all columns of Boston data",
        xlab = "variable name",
        ylab = "",
        col = "orange",
        border = "brown",
        horizontal = F,
        notch = F)

As the graph shows, the column “tax” has highest variation then black. The values in other columns have quite less variation!

Let’s see the strcutre of data even better using glimpse

# read libraries
library(tidyr); library(dplyr); library(ggplot2)

# glimpse at the alc data
glimpse(Boston) 
## Rows: 506
## Columns: 14
## $ crim    <dbl> 0.00632, 0.02731, 0.02729, 0.03237, 0.06905, 0.02985, 0.088...
## $ zn      <dbl> 18.0, 0.0, 0.0, 0.0, 0.0, 0.0, 12.5, 12.5, 12.5, 12.5, 12.5...
## $ indus   <dbl> 2.31, 7.07, 7.07, 2.18, 2.18, 2.18, 7.87, 7.87, 7.87, 7.87,...
## $ chas    <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,...
## $ nox     <dbl> 0.538, 0.469, 0.469, 0.458, 0.458, 0.458, 0.524, 0.524, 0.5...
## $ rm      <dbl> 6.575, 6.421, 7.185, 6.998, 7.147, 6.430, 6.012, 6.172, 5.6...
## $ age     <dbl> 65.2, 78.9, 61.1, 45.8, 54.2, 58.7, 66.6, 96.1, 100.0, 85.9...
## $ dis     <dbl> 4.0900, 4.9671, 4.9671, 6.0622, 6.0622, 6.0622, 5.5605, 5.9...
## $ rad     <int> 1, 2, 2, 3, 3, 3, 5, 5, 5, 5, 5, 5, 5, 4, 4, 4, 4, 4, 4, 4,...
## $ tax     <dbl> 296, 242, 242, 222, 222, 222, 311, 311, 311, 311, 311, 311,...
## $ ptratio <dbl> 15.3, 17.8, 17.8, 18.7, 18.7, 18.7, 15.2, 15.2, 15.2, 15.2,...
## $ black   <dbl> 396.90, 396.90, 392.83, 394.63, 396.90, 394.12, 395.60, 396...
## $ lstat   <dbl> 4.98, 9.14, 4.03, 2.94, 5.33, 5.21, 12.43, 19.15, 29.93, 17...
## $ medv    <dbl> 24.0, 21.6, 34.7, 33.4, 36.2, 28.7, 22.9, 27.1, 16.5, 18.9,...
# use gather() to gather columns into key-value pairs and then glimpse() at the resulting data
gather(Boston) %>% glimpse
## Rows: 7,084
## Columns: 2
## $ key   <chr> "crim", "crim", "crim", "crim", "crim", "crim", "crim", "crim...
## $ value <dbl> 0.00632, 0.02731, 0.02729, 0.03237, 0.06905, 0.02985, 0.08829...
# draw a bar plot of each variable
gather(Boston) %>% ggplot(aes(value)) + facet_wrap("key", scales = "free") + geom_bar()

The graph shows barplot of each column.

Let’s investigate any possible correlctions within data calculate the correlation matrix and round it.

cor_matrix<-cor(Boston) %>% round(digits = 2)

# print the correlation matrix
cor_matrix
##          crim    zn indus  chas   nox    rm   age   dis   rad   tax ptratio
## crim     1.00 -0.20  0.41 -0.06  0.42 -0.22  0.35 -0.38  0.63  0.58    0.29
## zn      -0.20  1.00 -0.53 -0.04 -0.52  0.31 -0.57  0.66 -0.31 -0.31   -0.39
## indus    0.41 -0.53  1.00  0.06  0.76 -0.39  0.64 -0.71  0.60  0.72    0.38
## chas    -0.06 -0.04  0.06  1.00  0.09  0.09  0.09 -0.10 -0.01 -0.04   -0.12
## nox      0.42 -0.52  0.76  0.09  1.00 -0.30  0.73 -0.77  0.61  0.67    0.19
## rm      -0.22  0.31 -0.39  0.09 -0.30  1.00 -0.24  0.21 -0.21 -0.29   -0.36
## age      0.35 -0.57  0.64  0.09  0.73 -0.24  1.00 -0.75  0.46  0.51    0.26
## dis     -0.38  0.66 -0.71 -0.10 -0.77  0.21 -0.75  1.00 -0.49 -0.53   -0.23
## rad      0.63 -0.31  0.60 -0.01  0.61 -0.21  0.46 -0.49  1.00  0.91    0.46
## tax      0.58 -0.31  0.72 -0.04  0.67 -0.29  0.51 -0.53  0.91  1.00    0.46
## ptratio  0.29 -0.39  0.38 -0.12  0.19 -0.36  0.26 -0.23  0.46  0.46    1.00
## black   -0.39  0.18 -0.36  0.05 -0.38  0.13 -0.27  0.29 -0.44 -0.44   -0.18
## lstat    0.46 -0.41  0.60 -0.05  0.59 -0.61  0.60 -0.50  0.49  0.54    0.37
## medv    -0.39  0.36 -0.48  0.18 -0.43  0.70 -0.38  0.25 -0.38 -0.47   -0.51
##         black lstat  medv
## crim    -0.39  0.46 -0.39
## zn       0.18 -0.41  0.36
## indus   -0.36  0.60 -0.48
## chas     0.05 -0.05  0.18
## nox     -0.38  0.59 -0.43
## rm       0.13 -0.61  0.70
## age     -0.27  0.60 -0.38
## dis      0.29 -0.50  0.25
## rad     -0.44  0.49 -0.38
## tax     -0.44  0.54 -0.47
## ptratio -0.18  0.37 -0.51
## black    1.00 -0.37  0.33
## lstat   -0.37  1.00 -0.74
## medv     0.33 -0.74  1.00

Exta coding: make publication ready interactive Tables in R

inspired from http://www.htmlwidgets.org/showcase_datatables.html

library('DT')
datatable(cor_matrix, options = list(pageLength = 10))

Or alternatively set the options to be limiting by ht enumber of rows of the imput df.

datatable(cor_matrix, options = list(nrow(cor_matrix)))

visualize the correlation matrix

library("corrplot")
## corrplot 0.84 loaded
corrplot(cor_matrix, method="circle", type="upper", 
         cl.pos="b", tl.pos="d", tl.cex = 0.7)

The graph shows the correlation between variables, the larger the circle (and the darker the color), the higher the correlation is, and color shows if the correlation is positive or negative." "for example, the correlation between nox and dis is strong and negative, ie the increase of one variable, decreases the other one. Or another example is rad and tax, which have strong positive correlation.
On the other hand, chas does not have good correlation with other variables.

Task 4 Let’s standardize the variables suing scale funcation

boston_scaled <- scale(Boston)

As center = True (by defult), now, this scale is done now by cantering is done by subtracting the column means (omitting NAs) of x from their corresponding columns.

# summaries of the scaled variables
summary(boston_scaled)
##       crim                 zn               indus              chas        
##  Min.   :-0.419367   Min.   :-0.48724   Min.   :-1.5563   Min.   :-0.2723  
##  1st Qu.:-0.410563   1st Qu.:-0.48724   1st Qu.:-0.8668   1st Qu.:-0.2723  
##  Median :-0.390280   Median :-0.48724   Median :-0.2109   Median :-0.2723  
##  Mean   : 0.000000   Mean   : 0.00000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 0.007389   3rd Qu.: 0.04872   3rd Qu.: 1.0150   3rd Qu.:-0.2723  
##  Max.   : 9.924110   Max.   : 3.80047   Max.   : 2.4202   Max.   : 3.6648  
##       nox                rm               age               dis         
##  Min.   :-1.4644   Min.   :-3.8764   Min.   :-2.3331   Min.   :-1.2658  
##  1st Qu.:-0.9121   1st Qu.:-0.5681   1st Qu.:-0.8366   1st Qu.:-0.8049  
##  Median :-0.1441   Median :-0.1084   Median : 0.3171   Median :-0.2790  
##  Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 0.5981   3rd Qu.: 0.4823   3rd Qu.: 0.9059   3rd Qu.: 0.6617  
##  Max.   : 2.7296   Max.   : 3.5515   Max.   : 1.1164   Max.   : 3.9566  
##       rad               tax             ptratio            black        
##  Min.   :-0.9819   Min.   :-1.3127   Min.   :-2.7047   Min.   :-3.9033  
##  1st Qu.:-0.6373   1st Qu.:-0.7668   1st Qu.:-0.4876   1st Qu.: 0.2049  
##  Median :-0.5225   Median :-0.4642   Median : 0.2746   Median : 0.3808  
##  Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 1.6596   3rd Qu.: 1.5294   3rd Qu.: 0.8058   3rd Qu.: 0.4332  
##  Max.   : 1.6596   Max.   : 1.7964   Max.   : 1.6372   Max.   : 0.4406  
##      lstat              medv        
##  Min.   :-1.5296   Min.   :-1.9063  
##  1st Qu.:-0.7986   1st Qu.:-0.5989  
##  Median :-0.1811   Median :-0.1449  
##  Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 0.6024   3rd Qu.: 0.2683  
##  Max.   : 3.5453   Max.   : 2.9865

So, the variables were standardize as explained above. So, the mean of each column were converted to 0 and other values were scales based on that. So, mean of all columns is 0.

# class of the boston_scaled object
class(boston_scaled)
## [1] "matrix" "array"

As you can see, the data class is “matrix” “array”, so we will change the object to dataframe.

# change the object to data frame
boston_scaled <- as.data.frame(boston_scaled)

summary(boston_scaled)
##       crim                 zn               indus              chas        
##  Min.   :-0.419367   Min.   :-0.48724   Min.   :-1.5563   Min.   :-0.2723  
##  1st Qu.:-0.410563   1st Qu.:-0.48724   1st Qu.:-0.8668   1st Qu.:-0.2723  
##  Median :-0.390280   Median :-0.48724   Median :-0.2109   Median :-0.2723  
##  Mean   : 0.000000   Mean   : 0.00000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 0.007389   3rd Qu.: 0.04872   3rd Qu.: 1.0150   3rd Qu.:-0.2723  
##  Max.   : 9.924110   Max.   : 3.80047   Max.   : 2.4202   Max.   : 3.6648  
##       nox                rm               age               dis         
##  Min.   :-1.4644   Min.   :-3.8764   Min.   :-2.3331   Min.   :-1.2658  
##  1st Qu.:-0.9121   1st Qu.:-0.5681   1st Qu.:-0.8366   1st Qu.:-0.8049  
##  Median :-0.1441   Median :-0.1084   Median : 0.3171   Median :-0.2790  
##  Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 0.5981   3rd Qu.: 0.4823   3rd Qu.: 0.9059   3rd Qu.: 0.6617  
##  Max.   : 2.7296   Max.   : 3.5515   Max.   : 1.1164   Max.   : 3.9566  
##       rad               tax             ptratio            black        
##  Min.   :-0.9819   Min.   :-1.3127   Min.   :-2.7047   Min.   :-3.9033  
##  1st Qu.:-0.6373   1st Qu.:-0.7668   1st Qu.:-0.4876   1st Qu.: 0.2049  
##  Median :-0.5225   Median :-0.4642   Median : 0.2746   Median : 0.3808  
##  Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 1.6596   3rd Qu.: 1.5294   3rd Qu.: 0.8058   3rd Qu.: 0.4332  
##  Max.   : 1.6596   Max.   : 1.7964   Max.   : 1.6372   Max.   : 0.4406  
##      lstat              medv        
##  Min.   :-1.5296   Min.   :-1.9063  
##  1st Qu.:-0.7986   1st Qu.:-0.5989  
##  Median :-0.1811   Median :-0.1449  
##  Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 0.6024   3rd Qu.: 0.2683  
##  Max.   : 3.5453   Max.   : 2.9865
str(boston_scaled)
## 'data.frame':    506 obs. of  14 variables:
##  $ crim   : num  -0.419 -0.417 -0.417 -0.416 -0.412 ...
##  $ zn     : num  0.285 -0.487 -0.487 -0.487 -0.487 ...
##  $ indus  : num  -1.287 -0.593 -0.593 -1.306 -1.306 ...
##  $ chas   : num  -0.272 -0.272 -0.272 -0.272 -0.272 ...
##  $ nox    : num  -0.144 -0.74 -0.74 -0.834 -0.834 ...
##  $ rm     : num  0.413 0.194 1.281 1.015 1.227 ...
##  $ age    : num  -0.12 0.367 -0.266 -0.809 -0.511 ...
##  $ dis    : num  0.14 0.557 0.557 1.077 1.077 ...
##  $ rad    : num  -0.982 -0.867 -0.867 -0.752 -0.752 ...
##  $ tax    : num  -0.666 -0.986 -0.986 -1.105 -1.105 ...
##  $ ptratio: num  -1.458 -0.303 -0.303 0.113 0.113 ...
##  $ black  : num  0.441 0.441 0.396 0.416 0.441 ...
##  $ lstat  : num  -1.074 -0.492 -1.208 -1.36 -1.025 ...
##  $ medv   : num  0.16 -0.101 1.323 1.182 1.486 ...
head(boston_scaled)
##         crim         zn      indus       chas        nox        rm        age
## 1 -0.4193669  0.2845483 -1.2866362 -0.2723291 -0.1440749 0.4132629 -0.1198948
## 2 -0.4169267 -0.4872402 -0.5927944 -0.2723291 -0.7395304 0.1940824  0.3668034
## 3 -0.4169290 -0.4872402 -0.5927944 -0.2723291 -0.7395304 1.2814456 -0.2655490
## 4 -0.4163384 -0.4872402 -1.3055857 -0.2723291 -0.8344581 1.0152978 -0.8090878
## 5 -0.4120741 -0.4872402 -1.3055857 -0.2723291 -0.8344581 1.2273620 -0.5106743
## 6 -0.4166314 -0.4872402 -1.3055857 -0.2723291 -0.8344581 0.2068916 -0.3508100
##        dis        rad        tax    ptratio     black      lstat       medv
## 1 0.140075 -0.9818712 -0.6659492 -1.4575580 0.4406159 -1.0744990  0.1595278
## 2 0.556609 -0.8670245 -0.9863534 -0.3027945 0.4406159 -0.4919525 -0.1014239
## 3 0.556609 -0.8670245 -0.9863534 -0.3027945 0.3960351 -1.2075324  1.3229375
## 4 1.076671 -0.7521778 -1.1050216  0.1129203 0.4157514 -1.3601708  1.1815886
## 5 1.076671 -0.7521778 -1.1050216  0.1129203 0.4406159 -1.0254866  1.4860323
## 6 1.076671 -0.7521778 -1.1050216  0.1129203 0.4101651 -1.0422909  0.6705582
boxplot(boston_scaled,
        main = "Boxplot of all columns of scaled Boston data",
        xlab = "variable name",
        ylab = "",
        col = "orange",
        border = "brown",
        horizontal = F,
        notch = F)

So, now the distribution of data is shown better. Yaaay.

# summary of the scaled crime rate
summary(boston_scaled$crim)
##      Min.   1st Qu.    Median      Mean   3rd Qu.      Max. 
## -0.419367 -0.410563 -0.390280  0.000000  0.007389  9.924110


As instructed, we shall now use quantile vector of column crim to create a categorical variable of the crime rate in the dataset.
So, let’s start.

# create a quantile vector of crim and print it
bins <- quantile(boston_scaled$crim)
bins
##           0%          25%          50%          75%         100% 
## -0.419366929 -0.410563278 -0.390280295  0.007389247  9.924109610

You can see the quantiles of the column called crim, for example minimum is -0.419366929 and first quantile is -0.410563278.
Now, we can catergorise the continuous (float) values based on the quantiles of data (as instructed).

# create a categorical variable 'crime'
crime <- cut(boston_scaled$crim, breaks = bins, include.lowest = TRUE, labels = c("low", "med_low", "med_high", "high"))

cut() divides the range of x into intervals and codes the values in x according to which interval they fall.
Now, we labelled the crim rate to be “low”, “med_low”, “med_high” or “high” based on the quantile values.

# look at the table of the new factor crime
table(crime)
## crime
##      low  med_low med_high     high 
##      127      126      126      127

Now, you can see that the number of observation divided/categorised to low is 127, medium low: 126, medium high 126 and high 127 cases/observation.

#we can even plot them
plot(crime)

# remove original crim from the dataset
boston_scaled <- dplyr::select(boston_scaled, -crim)

# add the new categorical value to scaled data
boston_scaled <- data.frame(boston_scaled, crime)
# number of rows in the Boston dataset 
n <- nrow(boston_scaled)

# choose randomly 80% of the rows as train dataset
ind <- sample(n,  size = n * 0.8)

# create train set
train <- boston_scaled[ind,]

# create test set by selecting those observations (rows) that were not in the training data (using -ind)
test <- boston_scaled[-ind,]

# save the correct classes from test data
correct_classes <- test$crime

# remove the crime variable from test data
test <- dplyr::select(test, -crime)

Task 5: Fit a linear discriminant analysis model on the train set.

lda.fit <- lda(crime ~ ., data = train)

This uses all variables (columns) of the train dataset to predict crime (the categorised column).

# print the lda.fit object
lda.fit
## Call:
## lda(crime ~ ., data = train)
## 
## Prior probabilities of groups:
##       low   med_low  med_high      high 
## 0.2623762 0.2623762 0.2500000 0.2252475 
## 
## Group means:
##                   zn      indus        chas        nox          rm        age
## low       1.00213355 -0.9476086 -0.08661679 -0.9041558  0.52002011 -0.8890870
## med_low  -0.06939036 -0.3460526 -0.04947434 -0.5585572 -0.08743919 -0.3446098
## med_high -0.37516530  0.1519498  0.23442641  0.3614941  0.09707591  0.4047910
## high     -0.48724019  1.0149946  0.03052480  1.0362149 -0.38478945  0.8020093
##                 dis        rad        tax    ptratio      black       lstat
## low       0.8729044 -0.6795860 -0.7616114 -0.5420266  0.3752771 -0.78955375
## med_low   0.3640106 -0.5474040 -0.4944960 -0.0513611  0.3215274 -0.17933054
## med_high -0.3579250 -0.4269683 -0.3336194 -0.2854159  0.1191139  0.01820543
## high     -0.8482471  1.6596029  1.5294129  0.8057784 -0.7825527  0.85321640
##                medv
## low       0.6296306
## med_low   0.0266927
## med_high  0.2163688
## high     -0.6747052
## 
## Coefficients of linear discriminants:
##                  LD1          LD2         LD3
## zn       0.113474382  0.650276513 -0.92252431
## indus    0.019255146 -0.233436410  0.23213490
## chas     0.003839934 -0.003850361  0.05327591
## nox      0.323321419 -0.878941245 -1.30928984
## rm       0.018385371 -0.054221609 -0.12158781
## age      0.187841746 -0.303487784 -0.12053016
## dis     -0.117882456 -0.307287364  0.14902817
## rad      3.570068793  1.051832079 -0.11658404
## tax      0.074081418 -0.042048791  0.57993027
## ptratio  0.175649970 -0.078662809 -0.19825941
## black   -0.170524837  0.042023588  0.17093377
## lstat    0.164648512 -0.241030885  0.26574105
## medv     0.078911213 -0.438437111 -0.28925299
## 
## Proportion of trace:
##    LD1    LD2    LD3 
## 0.9558 0.0344 0.0098
# the function for lda biplot arrows
lda.arrows <- function(x, myscale = 1, arrow_heads = 0.1, color = "orange", tex = 0.75, choices = c(1,2)){
  heads <- coef(x)
  arrows(x0 = 0, y0 = 0, 
         x1 = myscale * heads[,choices[1]], 
         y1 = myscale * heads[,choices[2]], col=color, length = arrow_heads)
  text(myscale * heads[,choices], labels = row.names(heads), 
       cex = tex, col=color, pos=3)
}

# target classes as numeric
classes <- as.numeric(train$crime)

# plot the lda results
plot(lda.fit, dimen = 2, col = classes, pch = classes)
lda.arrows(lda.fit, myscale = 1)

Predict:

# predict classes with test data
lda.pred <- predict(lda.fit, newdata = test)

# cross tabulate the results
table(correct = correct_classes, predicted = lda.pred$class)
##           predicted
## correct    low med_low med_high high
##   low       10      10        1    0
##   med_low    1      15        4    0
##   med_high   0       9       14    2
##   high       0       0        1   35

EXTRA coding 1: Create a nice graph with confusion matrix

insprited from https://stackoverflow.com/a/64539733

library('caret')
cm <- confusionMatrix(factor(lda.pred$class), factor(correct_classes), dnn = c("Prediction", "Reference"))

ggplot(as.data.frame(cm$table), aes(Prediction,sort(Reference,decreasing = T), fill= Freq)) +
  geom_tile() + geom_text(aes(label=Freq)) +
  scale_fill_gradient(low="white", high="#009193") +
  labs(x = "Reference",y = "Prediction") +
  scale_x_discrete(labels=c('low','med_low','med_high','high')) +
  scale_y_discrete(labels=c('high', 'med_high', 'med_low', 'low'))

EXTRA coding 2: Calculate overall accuracy

Overall accuracy shows the overall number of correctly classified labels over the incorrect ones.
code inspritred from https://stackoverflow.com/a/24349171

all_accuracies= cm$overall
all_accuracies
##       Accuracy          Kappa  AccuracyLower  AccuracyUpper   AccuracyNull 
##   7.254902e-01   6.272027e-01   6.282394e-01   8.092144e-01   3.529412e-01 
## AccuracyPValue  McnemarPValue 
##   2.056819e-14            NaN
all_accuracies['Accuracy']*100
## Accuracy 
## 72.54902
all_accuracies['Kappa']*100
##    Kappa 
## 62.72027

The confusing matirx (graph) shows that for example 25 of class high were predicted correctly as high, but one of the misclassified to med_high.
As the confusion matrix shows, most f high were classified to high.

Task 7: Reload the dataset and calculate the distances between the observations.

# load MASS and Boston
library(MASS)
data('Boston')

# euclidean distance matrix
dist_eu <- dist(Boston)

# look at the summary of the distances
summary(dist_eu)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   1.119  85.624 170.539 226.315 371.950 626.047
# manhattan distance matrix
dist_man <- dist(Boston, method = 'manhattan')

# look at the summary of the distances
summary(dist_man)
##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
##    2.016  149.145  279.505  342.899  509.707 1198.265

As you can see, mean of the distance between variables is 342.899 (2.02 and 1198.26 as min and max, respectively).

Part 2) k-means clustering

Let’s cluster dataset to 3 classes, in a way that the homogeneity within cluster would be more than between clusters. Clusters are identified by the assessment of the relative distances between points, and, in this example, the relative homogeneity of each cluster and the degree of separation between the clusters makes the task very simple (Source: Multivariate Analysis for the Behavioral Sciences).

km <-kmeans(Boston, centers = 3)

# plot the Boston dataset with clusters
pairs(Boston, col = km$cluster)

These pairs of plots show if we cluster the dataset to 3 clusters, then how differentically/discriminately can the variables be from each other.
For example, tax and rad, tax and black, or tax and lstat columns were cluster-able visually. Interestingly, these columns were shown some high correlation between each other.
Another example, the correlation between tax and rad is 0.91 (see below). Therefore, I plot the correlation again in below. FYI: we can plot them in 3D to see visually better. More info: check bonus exercises.

cor(Boston$tax, Boston$rad)
## [1] 0.9102282
# visualize the correlation matrix
corrplot(cor_matrix, method="circle", type="upper", 
         cl.pos="b", tl.pos="d", tl.cex = 0.7)

# Investigate the optimal number of clusters
set.seed(1)

# determine the number of clusters
k_max <- 10

# calculate the total within sum of squares
twcss <- sapply(1:k_max, function(k){kmeans(Boston, k)$tot.withinss})

# visualize the results
qplot(x = 1:k_max, y = twcss, geom = 'line', xlab= "number of clusters", ylab="Within groups sum of squares")

As the results show in the figure, the total within sum of squares declines sharply by increasing the number of K’s, but after 5, is gets almost steady, so something like 4 or 5 number of clusters is enough.

# k-means clustering
km <-kmeans(Boston, centers = 2) #number of clusters
km
## K-means clustering with 2 clusters of sizes 369, 137
## 
## Cluster means:
##         crim       zn     indus       chas       nox       rm      age      dis
## 1  0.3887744 15.58266  8.420894 0.07317073 0.5118474 6.388005 60.63225 4.441272
## 2 12.2991617  0.00000 18.451825 0.05839416 0.6701022 6.006212 89.96788 2.054470
##         rad      tax  ptratio    black    lstat     medv
## 1  4.455285 311.9268 17.80921 381.0426 10.41745 24.85718
## 2 23.270073 667.6423 20.19635 291.0391 18.67453 16.27226
## 
## Clustering vector:
##   1   2   3   4   5   6   7   8   9  10  11  12  13  14  15  16  17  18  19  20 
##   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1 
##  21  22  23  24  25  26  27  28  29  30  31  32  33  34  35  36  37  38  39  40 
##   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1 
##  41  42  43  44  45  46  47  48  49  50  51  52  53  54  55  56  57  58  59  60 
##   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1 
##  61  62  63  64  65  66  67  68  69  70  71  72  73  74  75  76  77  78  79  80 
##   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1 
##  81  82  83  84  85  86  87  88  89  90  91  92  93  94  95  96  97  98  99 100 
##   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1 
## 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 
##   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1 
## 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 
##   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1 
## 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 
##   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1 
## 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 
##   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1 
## 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 
##   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1 
## 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 
##   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1 
## 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 
##   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1 
## 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 
##   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1 
## 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 
##   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1 
## 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 
##   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1 
## 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 
##   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1 
## 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 
##   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1 
## 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360 
##   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   2   2   2   2 
## 361 362 363 364 365 366 367 368 369 370 371 372 373 374 375 376 377 378 379 380 
##   2   2   2   2   2   2   2   2   2   2   2   2   2   2   2   2   2   2   2   2 
## 381 382 383 384 385 386 387 388 389 390 391 392 393 394 395 396 397 398 399 400 
##   2   2   2   2   2   2   2   2   2   2   2   2   2   2   2   2   2   2   2   2 
## 401 402 403 404 405 406 407 408 409 410 411 412 413 414 415 416 417 418 419 420 
##   2   2   2   2   2   2   2   2   2   2   2   2   2   2   2   2   2   2   2   2 
## 421 422 423 424 425 426 427 428 429 430 431 432 433 434 435 436 437 438 439 440 
##   2   2   2   2   2   2   2   2   2   2   2   2   2   2   2   2   2   2   2   2 
## 441 442 443 444 445 446 447 448 449 450 451 452 453 454 455 456 457 458 459 460 
##   2   2   2   2   2   2   2   2   2   2   2   2   2   2   2   2   2   2   2   2 
## 461 462 463 464 465 466 467 468 469 470 471 472 473 474 475 476 477 478 479 480 
##   2   2   2   2   2   2   2   2   2   2   2   2   2   2   2   2   2   2   2   2 
## 481 482 483 484 485 486 487 488 489 490 491 492 493 494 495 496 497 498 499 500 
##   2   2   2   2   2   2   2   2   2   2   2   2   2   1   1   1   1   1   1   1 
## 501 502 503 504 505 506 
##   1   1   1   1   1   1 
## 
## Within cluster sum of squares by cluster:
## [1] 2868770 2896224
##  (between_SS / total_SS =  70.3 %)
## 
## Available components:
## 
## [1] "cluster"      "centers"      "totss"        "withinss"     "tot.withinss"
## [6] "betweenss"    "size"         "iter"         "ifault"

EXTRA coding 3: Run a loop to optimse number of clusters

Here I analyse different number of clusters by changing the centers = 2) and report below the Within cluster sum of squares by cluster.
By looking at the code, I understood that within cluster sum of squares by cluster is calculated by dividing between_SS to total_SS. So, let’s write that to our loop to get that printed. Hence, I wrote the code to print this as it loops.

centers_list = c(2, 3,4,5,6,7,8,9,10,15,20)
for (i in centers_list){
  print('.........')
  print(i)
  km <-kmeans(Boston, centers = i)
  #print(km)
  print(km$betweenss/km$totss*100)
}
## [1] "........."
## [1] 2
## [1] 70.28516
## [1] "........."
## [1] 3
## [1] 84.18386
## [1] "........."
## [1] 4
## [1] 90.64774
## [1] "........."
## [1] 5
## [1] 79.60841
## [1] "........."
## [1] 6
## [1] 93.57224
## [1] "........."
## [1] 7
## [1] 94.17725
## [1] "........."
## [1] 8
## [1] 82.01612
## [1] "........."
## [1] 9
## [1] 95.79905
## [1] "........."
## [1] 10
## [1] 82.72624
## [1] "........."
## [1] 15
## [1] 97.35675
## [1] "........."
## [1] 20
## [1] 97.86162

The above number are the number of cluster, and within cluster sum of squares by cluster.
Interpretaion Althogh the accuracy improves as the number of clusters increases, it seems 4 clusters to be fine, because accuracy improved dramatically until there, then get almost steady.
Notably, I think this depends on application and aim of each project. For example, my personal experience in forest mapping using aerial imagery, the increase of number of clusters, often led to horrible maps, thus somewhere between 3-5 is good.

Thus, as explained above, we can use 4 number of clusters to not get too complicated and doable in this exercise. I run this clustering again to get it saved in R object.

km <-kmeans(Boston, centers = 4)

# plot the Boston dataset with clusters
pairs(Boston, col = km$cluster)

As figure shows, and explained above, using some columns like tax and black, tax and dist are easily separable.

EXTRA coding 4: Another code to compute and plot the total within-cluster sum of square

inspired from https://www.r-bloggers.com/2020/05/practical-guide-to-k-means-clustering/ This can be considered as an alternative way.

wssplot <- function(Boston, nc=15, seed=1234){
  wss <- (nrow(Boston)-1)*sum(apply(Boston,2,var))
  for (i in 2:nc){
    set.seed(seed)
    wss[i] <- sum(kmeans(Boston, centers=i)$withinss)}
  plot(1:nc, wss, type="b", xlab="Number of Clusters",
       ylab="Within groups sum of squares")}

# plotting values for each cluster starting from 1 to 9
wssplot(Boston, nc = 9)

Interpretation: As the plot shows, the total within-cluster sum of square declines as number of clusters increases. However, gets steady after 5, Thus, selecting 4 or 5 number of clusters sounds optimal.

EXTRA coding 5: Another method to find optimal number of clusters for k-means

inspired from: http://www.sthda.com/english/wiki/factoextra-r-package-easy-multivariate-data-analyses-and-elegant-visualization#cluster-analysis-and-factoextra

library("factoextra")
## Welcome! Want to learn more? See two factoextra-related books at https://goo.gl/ve3WBa
my_data <- scale(Boston)
clus_plot = fviz_nbclust(my_data, kmeans, method = "wss") #method = "wss" is total within sum of square
clus_plot  

fviz_nbclust function determines and visualize the optimal number of clusters using different methods: within cluster sums of squares.

Bouns 2

model_predictors <- dplyr::select(train, -crime)

# check the dimensions
dim(model_predictors)
## [1] 404  13
dim(lda.fit$scaling)
## [1] 13  3
# matrix multiplication
library("plotly")
## 
## Attaching package: 'plotly'
## The following object is masked from 'package:MASS':
## 
##     select
## The following object is masked from 'package:ggplot2':
## 
##     last_plot
## The following object is masked from 'package:stats':
## 
##     filter
## The following object is masked from 'package:graphics':
## 
##     layout
library("colorspace")
#red <- hex(HLS(0, 0.5, 1))
matrix_product <- as.matrix(model_predictors) %*% lda.fit$scaling
matrix_product <- as.data.frame(matrix_product)
plot_ly(x = matrix_product$LD1, y = matrix_product$LD2, 
        z = matrix_product$LD3, type= 'scatter3d', mode='markers', 
        color = train$crime, colors = c("blue", "yellow", "black", "red"))
## Warning: `arrange_()` is deprecated as of dplyr 0.7.0.
## Please use `arrange()` instead.
## See vignette('programming') for more help
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_warnings()` to see where this warning was generated.

Please feel free to Zoom on and out on the 3D figure :) As you can see, we can see and distingush the 4 classes by zooming in the above 3D figure.

require(plot3D)
## Loading required package: plot3D
attach(mtcars)
## The following object is masked from package:ggplot2:
## 
##     mpg
scatter3D(x = matrix_product$LD1, y = matrix_product$LD2, 
          z = matrix_product$LD3, pch = 18, cex = 2, 
          theta = 15, phi = 20, #by chaning the theta and phi, change your view angle to 3d plot
          ticktype = "detailed",
          xlab = "LD1", ylab = "LD2", zlab = "LD3", clab = "", 
          colkey = list(length = 0.8, width = 0.4),            
          main = "3D scatter of clusters", col=c("blue", "yellow", "black", "red"))

Technical note: by chaning the theta and phi, change your view angle to 3d plot.

From this view, we can see that it’s clear to discriminate two cluster from this view (one cluster in right side and other one in left side), however, if we were able to rotate this plot, we could rotate and see other side to discuss more about the 4 possible clusters visually. I made it possible in below extra coding.

EXTRA coding 7: Create interactive 3D plot; interactively rotate and zoom the graphs.

inspired from http://www.sthda.com/english/wiki/impressive-package-for-3d-and-4d-graph-r-software-and-data-visualization

# Make the rgl version
library("plot3Drgl")
## Loading required package: rgl
plotrgl()

Please feel free to Zoom on and out on the 3D figure :)
Technical note: You should run your 3d plot then run this code to plot it in interactive zoomable way.
We can see that it is very easy to make 3-4 clusters visually.

FYI: You can even plot a histogram of all columns in 3D and zoom in it: hist3D(z=as.matrix(test))

Yet another way to visualize

# plot the lda results
plot(lda.fit, dimen = 3, col = classes, pch = classes)

So, we can see that it is very easy to make 2 clusters visually. and even possible to 4 clusters.

EXTRA coding 7: to show the clusters in graph

inspired from https://www.r-bloggers.com/2020/05/practical-guide-to-k-means-clustering/

#Related to bouns 2: coloring
library(factoextra)

fviz_cluster(km, data = Boston, main = "Partitioning Clustering Plot")

Technical note: fviz_cluster provides ggplot2-based elegant visualization of partitioning methods including kmeans.
Since the graph is 2D, we cannot rotate it or zoom to see other views of the graph. But generally, it looks good!


options(knitr.duplicate.label = "allow")
#to sovel following error (mentioned below), I put this chunk to run and solved the error:
#error: Error in parse_block(g[-1], g[1], params.src, markdown_mode) :    Duplicate chunk label 'setup', which has been used for the chunk: knitr::opts_chunk$set(echo = TRUE) Calls: <Anonymous> ... process_file -> split_file -> lapply -> FUN -> parse_block
#insprited from: https://bookdown.org/yihui/rmarkdown-cookbook/duplicate-label.html

About the project

Welcome to my page

This is the learning diary for the course ‘Introduction to Open Data Science 2020’. I would like to further develop my data science skills in R and get familiar with Github.
My GitHub repository is https://github.com/Imangholiloo/IODS-project

# Let's get the date and time
date()
## [1] "Wed Nov 25 19:31:03 2020"

alternatively, print(Sys.time())

————————————————————————————

Exercise 5: Dimensionality reduction techniques


Mohamamd Imangholiloo
My GitHub repository is https://github.com/Imangholiloo/IODS-project
Nowadays, it’s common to have huge datasets (like Big Data) in which there could be multiple variables with high correlation with each other, or noise within the data. Meaning that, one variable can explain more of the other one. Thus, it is good to reduce the high dimensionality of our data.
There are many available methods, but we will learn about principal component analysis (PCA), which reduces any number of measured (continuous) and correlated variables into a few uncorrelated components that collect together as much variance as possible from the original variables. And Multiple correspondence analysis (MCA) which gives us similar possibilities in the world of discrete variables, even nominal scale (classified) variables, by finding a suitable transformation into continuous scales and then reducing the dimensions quite analogously with the PCA. More info: https://vimeo.com/204164956

Part 1) Data wrangling

#Read the “Human development” file into R
hd <- read.csv("http://s3.amazonaws.com/assets.datacamp.com/production/course_2218/datasets/human_development.csv", stringsAsFactors = F)

#Read the “Gender inequality” file into R
gii <- read.csv("http://s3.amazonaws.com/assets.datacamp.com/production/course_2218/datasets/gender_inequality.csv", stringsAsFactors = F, na.strings = "..")

For data description and more info please visit: http://hdr.undp.org/en/content/human-development-index-hdi and http://hdr.undp.org/sites/default/files/hdr2015_technical_notes.pdf

#strcutre of data
str(hd)
## 'data.frame':    195 obs. of  8 variables:
##  $ HDI.Rank                              : int  1 2 3 4 5 6 6 8 9 9 ...
##  $ Country                               : chr  "Norway" "Australia" "Switzerland" "Denmark" ...
##  $ Human.Development.Index..HDI.         : num  0.944 0.935 0.93 0.923 0.922 0.916 0.916 0.915 0.913 0.913 ...
##  $ Life.Expectancy.at.Birth              : num  81.6 82.4 83 80.2 81.6 80.9 80.9 79.1 82 81.8 ...
##  $ Expected.Years.of.Education           : num  17.5 20.2 15.8 18.7 17.9 16.5 18.6 16.5 15.9 19.2 ...
##  $ Mean.Years.of.Education               : num  12.6 13 12.8 12.7 11.9 13.1 12.2 12.9 13 12.5 ...
##  $ Gross.National.Income..GNI..per.Capita: chr  "64,992" "42,261" "56,431" "44,025" ...
##  $ GNI.per.Capita.Rank.Minus.HDI.Rank    : int  5 17 6 11 9 11 16 3 11 23 ...
#dimension of file (row* column)
dim(hd)
## [1] 195   8
#summary of file
summary(hd)
##     HDI.Rank        Country          Human.Development.Index..HDI.
##  Min.   :  1.00   Length:195         Min.   :0.3480               
##  1st Qu.: 47.75   Class :character   1st Qu.:0.5770               
##  Median : 94.00   Mode  :character   Median :0.7210               
##  Mean   : 94.31                      Mean   :0.6918               
##  3rd Qu.:141.25                      3rd Qu.:0.8000               
##  Max.   :188.00                      Max.   :0.9440               
##  NA's   :7                                                        
##  Life.Expectancy.at.Birth Expected.Years.of.Education Mean.Years.of.Education
##  Min.   :49.00            Min.   : 4.10               Min.   : 1.400         
##  1st Qu.:65.75            1st Qu.:11.10               1st Qu.: 5.550         
##  Median :73.10            Median :13.10               Median : 8.400         
##  Mean   :71.07            Mean   :12.86               Mean   : 8.079         
##  3rd Qu.:76.80            3rd Qu.:14.90               3rd Qu.:10.600         
##  Max.   :84.00            Max.   :20.20               Max.   :13.100         
##                                                                              
##  Gross.National.Income..GNI..per.Capita GNI.per.Capita.Rank.Minus.HDI.Rank
##  Length:195                             Min.   :-84.0000                  
##  Class :character                       1st Qu.: -9.0000                  
##  Mode  :character                       Median :  1.5000                  
##                                         Mean   :  0.1862                  
##                                         3rd Qu.: 11.0000                  
##                                         Max.   : 47.0000                  
##                                         NA's   :7

As you can see, our columns are all numerical, except two columns of country and GNI_per_capita, which are class

#column names
colnames(hd)
## [1] "HDI.Rank"                              
## [2] "Country"                               
## [3] "Human.Development.Index..HDI."         
## [4] "Life.Expectancy.at.Birth"              
## [5] "Expected.Years.of.Education"           
## [6] "Mean.Years.of.Education"               
## [7] "Gross.National.Income..GNI..per.Capita"
## [8] "GNI.per.Capita.Rank.Minus.HDI.Rank"

As you can see, column names are long, so in next phases, we will shorten the names.

#see head and tail of the data
head(hd)
##   HDI.Rank     Country Human.Development.Index..HDI. Life.Expectancy.at.Birth
## 1        1      Norway                         0.944                     81.6
## 2        2   Australia                         0.935                     82.4
## 3        3 Switzerland                         0.930                     83.0
## 4        4     Denmark                         0.923                     80.2
## 5        5 Netherlands                         0.922                     81.6
## 6        6     Germany                         0.916                     80.9
##   Expected.Years.of.Education Mean.Years.of.Education
## 1                        17.5                    12.6
## 2                        20.2                    13.0
## 3                        15.8                    12.8
## 4                        18.7                    12.7
## 5                        17.9                    11.9
## 6                        16.5                    13.1
##   Gross.National.Income..GNI..per.Capita GNI.per.Capita.Rank.Minus.HDI.Rank
## 1                                 64,992                                  5
## 2                                 42,261                                 17
## 3                                 56,431                                  6
## 4                                 44,025                                 11
## 5                                 45,435                                  9
## 6                                 43,919                                 11
tail(hd)
##     HDI.Rank                         Country Human.Development.Index..HDI.
## 190       NA       East Asia and the Pacific                         0.710
## 191       NA         Europe and Central Asia                         0.748
## 192       NA Latin America and the Caribbean                         0.748
## 193       NA                      South Asia                         0.607
## 194       NA              Sub-Saharan Africa                         0.518
## 195       NA                           World                         0.711
##     Life.Expectancy.at.Birth Expected.Years.of.Education
## 190                     74.0                        12.7
## 191                     72.3                        13.6
## 192                     75.0                        14.0
## 193                     68.4                        11.2
## 194                     58.5                         9.6
## 195                     71.5                        12.2
##     Mean.Years.of.Education Gross.National.Income..GNI..per.Capita
## 190                     7.5                                 11,449
## 191                    10.0                                 12,791
## 192                     8.2                                 14,242
## 193                     5.5                                  5,605
## 194                     5.2                                  3,363
## 195                     7.9                                 14,301
##     GNI.per.Capita.Rank.Minus.HDI.Rank
## 190                                 NA
## 191                                 NA
## 192                                 NA
## 193                                 NA
## 194                                 NA
## 195                                 NA
#Let's do the same for gii dataset
#strcutre of data
str(gii)
## 'data.frame':    195 obs. of  10 variables:
##  $ GII.Rank                                    : int  1 2 3 4 5 6 6 8 9 9 ...
##  $ Country                                     : chr  "Norway" "Australia" "Switzerland" "Denmark" ...
##  $ Gender.Inequality.Index..GII.               : num  0.067 0.11 0.028 0.048 0.062 0.041 0.113 0.28 0.129 0.157 ...
##  $ Maternal.Mortality.Ratio                    : int  4 6 6 5 6 7 9 28 11 8 ...
##  $ Adolescent.Birth.Rate                       : num  7.8 12.1 1.9 5.1 6.2 3.8 8.2 31 14.5 25.3 ...
##  $ Percent.Representation.in.Parliament        : num  39.6 30.5 28.5 38 36.9 36.9 19.9 19.4 28.2 31.4 ...
##  $ Population.with.Secondary.Education..Female.: num  97.4 94.3 95 95.5 87.7 96.3 80.5 95.1 100 95 ...
##  $ Population.with.Secondary.Education..Male.  : num  96.7 94.6 96.6 96.6 90.5 97 78.6 94.8 100 95.3 ...
##  $ Labour.Force.Participation.Rate..Female.    : num  61.2 58.8 61.8 58.7 58.5 53.6 53.1 56.3 61.6 62 ...
##  $ Labour.Force.Participation.Rate..Male.      : num  68.7 71.8 74.9 66.4 70.6 66.4 68.1 68.9 71 73.8 ...
#dimension of file (row* column)
dim(gii)
## [1] 195  10
#summary of file
summary(gii)
##     GII.Rank        Country          Gender.Inequality.Index..GII.
##  Min.   :  1.00   Length:195         Min.   :0.0160               
##  1st Qu.: 47.75   Class :character   1st Qu.:0.2030               
##  Median : 94.00   Mode  :character   Median :0.3935               
##  Mean   : 94.31                      Mean   :0.3695               
##  3rd Qu.:141.25                      3rd Qu.:0.5272               
##  Max.   :188.00                      Max.   :0.7440               
##  NA's   :7                           NA's   :33                   
##  Maternal.Mortality.Ratio Adolescent.Birth.Rate
##  Min.   :   1.0           Min.   :  0.60       
##  1st Qu.:  16.0           1st Qu.: 15.45       
##  Median :  69.0           Median : 40.95       
##  Mean   : 163.2           Mean   : 49.55       
##  3rd Qu.: 230.0           3rd Qu.: 71.78       
##  Max.   :1100.0           Max.   :204.80       
##  NA's   :10               NA's   :5            
##  Percent.Representation.in.Parliament
##  Min.   : 0.00                       
##  1st Qu.:12.47                       
##  Median :19.50                       
##  Mean   :20.60                       
##  3rd Qu.:27.02                       
##  Max.   :57.50                       
##  NA's   :3                           
##  Population.with.Secondary.Education..Female.
##  Min.   :  0.9                               
##  1st Qu.: 27.8                               
##  Median : 55.7                               
##  Mean   : 54.8                               
##  3rd Qu.: 81.8                               
##  Max.   :100.0                               
##  NA's   :26                                  
##  Population.with.Secondary.Education..Male.
##  Min.   :  3.20                            
##  1st Qu.: 38.30                            
##  Median : 60.00                            
##  Mean   : 60.29                            
##  3rd Qu.: 85.80                            
##  Max.   :100.00                            
##  NA's   :26                                
##  Labour.Force.Participation.Rate..Female.
##  Min.   :13.50                           
##  1st Qu.:44.50                           
##  Median :53.30                           
##  Mean   :52.61                           
##  3rd Qu.:62.62                           
##  Max.   :88.10                           
##  NA's   :11                              
##  Labour.Force.Participation.Rate..Male.
##  Min.   :44.20                         
##  1st Qu.:68.88                         
##  Median :75.55                         
##  Mean   :74.74                         
##  3rd Qu.:80.15                         
##  Max.   :95.50                         
##  NA's   :11

As summary shows, all variables, except country, are numerical variable

#column names
colnames(gii)
##  [1] "GII.Rank"                                    
##  [2] "Country"                                     
##  [3] "Gender.Inequality.Index..GII."               
##  [4] "Maternal.Mortality.Ratio"                    
##  [5] "Adolescent.Birth.Rate"                       
##  [6] "Percent.Representation.in.Parliament"        
##  [7] "Population.with.Secondary.Education..Female."
##  [8] "Population.with.Secondary.Education..Male."  
##  [9] "Labour.Force.Participation.Rate..Female."    
## [10] "Labour.Force.Participation.Rate..Male."
#see head and tail of the data
head(gii)
##   GII.Rank     Country Gender.Inequality.Index..GII. Maternal.Mortality.Ratio
## 1        1      Norway                         0.067                        4
## 2        2   Australia                         0.110                        6
## 3        3 Switzerland                         0.028                        6
## 4        4     Denmark                         0.048                        5
## 5        5 Netherlands                         0.062                        6
## 6        6     Germany                         0.041                        7
##   Adolescent.Birth.Rate Percent.Representation.in.Parliament
## 1                   7.8                                 39.6
## 2                  12.1                                 30.5
## 3                   1.9                                 28.5
## 4                   5.1                                 38.0
## 5                   6.2                                 36.9
## 6                   3.8                                 36.9
##   Population.with.Secondary.Education..Female.
## 1                                         97.4
## 2                                         94.3
## 3                                         95.0
## 4                                         95.5
## 5                                         87.7
## 6                                         96.3
##   Population.with.Secondary.Education..Male.
## 1                                       96.7
## 2                                       94.6
## 3                                       96.6
## 4                                       96.6
## 5                                       90.5
## 6                                       97.0
##   Labour.Force.Participation.Rate..Female.
## 1                                     61.2
## 2                                     58.8
## 3                                     61.8
## 4                                     58.7
## 5                                     58.5
## 6                                     53.6
##   Labour.Force.Participation.Rate..Male.
## 1                                   68.7
## 2                                   71.8
## 3                                   74.9
## 4                                   66.4
## 5                                   70.6
## 6                                   66.4
tail(gii)
##     GII.Rank                         Country Gender.Inequality.Index..GII.
## 190       NA       East Asia and the Pacific                         0.328
## 191       NA         Europe and Central Asia                         0.300
## 192       NA Latin America and the Caribbean                         0.415
## 193       NA                      South Asia                         0.536
## 194       NA              Sub-Saharan Africa                         0.575
## 195       NA                           World                         0.449
##     Maternal.Mortality.Ratio Adolescent.Birth.Rate
## 190                       72                  21.2
## 191                       28                  30.8
## 192                       85                  68.3
## 193                      183                  38.7
## 194                      506                 109.7
## 195                      210                  47.4
##     Percent.Representation.in.Parliament
## 190                                 18.7
## 191                                 19.0
## 192                                 27.0
## 193                                 17.5
## 194                                 22.5
## 195                                 21.8
##     Population.with.Secondary.Education..Female.
## 190                                         54.7
## 191                                         70.8
## 192                                         54.3
## 193                                         29.1
## 194                                         22.1
## 195                                         54.5
##     Population.with.Secondary.Education..Male.
## 190                                       66.3
## 191                                       80.6
## 192                                       55.2
## 193                                       54.6
## 194                                       31.5
## 195                                       65.4
##     Labour.Force.Participation.Rate..Female.
## 190                                     62.6
## 191                                     45.6
## 192                                     53.7
## 193                                     29.8
## 194                                     65.4
## 195                                     50.3
##     Labour.Force.Participation.Rate..Male.
## 190                                   79.4
## 191                                   70.0
## 192                                   79.8
## 193                                   80.3
## 194                                   76.6
## 195                                   76.7

Task 4 : rename the variables with (shorter) descriptive names

colnames(hd)
## [1] "HDI.Rank"                              
## [2] "Country"                               
## [3] "Human.Development.Index..HDI."         
## [4] "Life.Expectancy.at.Birth"              
## [5] "Expected.Years.of.Education"           
## [6] "Mean.Years.of.Education"               
## [7] "Gross.National.Income..GNI..per.Capita"
## [8] "GNI.per.Capita.Rank.Minus.HDI.Rank"
colnames(hd) <- c('HDI_Rank', 'Country', 'HDI', 'Life_Expec_Birth', 
                  'Expec_yr_Edu', 'Mean_yr_Edu', 'GNI_per_Cap', 
                  'GNI_per_Cap_Min_HDI_Rank')
colnames(hd)
## [1] "HDI_Rank"                 "Country"                 
## [3] "HDI"                      "Life_Expec_Birth"        
## [5] "Expec_yr_Edu"             "Mean_yr_Edu"             
## [7] "GNI_per_Cap"              "GNI_per_Cap_Min_HDI_Rank"

so you can see that column names were shorten.

Let’s do the same fo gii dataset

colnames(gii)
##  [1] "GII.Rank"                                    
##  [2] "Country"                                     
##  [3] "Gender.Inequality.Index..GII."               
##  [4] "Maternal.Mortality.Ratio"                    
##  [5] "Adolescent.Birth.Rate"                       
##  [6] "Percent.Representation.in.Parliament"        
##  [7] "Population.with.Secondary.Education..Female."
##  [8] "Population.with.Secondary.Education..Male."  
##  [9] "Labour.Force.Participation.Rate..Female."    
## [10] "Labour.Force.Participation.Rate..Male."
colnames(gii) <- c('GII_Rank', 'Country', 'GII', 'Mortality_R', 
                  'Adolescent_Birth_R', 'Representation_Parliament',
                  'Sec_Edu_Fem','Sec_Edu_Mal',
                  "Lab_Forc_Particip_R_Fem",
                  "Lab_Forc_Particip_R_Mal" )
colnames(gii)
##  [1] "GII_Rank"                  "Country"                  
##  [3] "GII"                       "Mortality_R"              
##  [5] "Adolescent_Birth_R"        "Representation_Parliament"
##  [7] "Sec_Edu_Fem"               "Sec_Edu_Mal"              
##  [9] "Lab_Forc_Particip_R_Fem"   "Lab_Forc_Particip_R_Mal"

So you can see that column names were shorten.

Task 5 . Mutate the “Gender inequality” data and create two new variables

library(dplyr)
gii <- mutate(gii, edu_R = (Sec_Edu_Fem/Sec_Edu_Mal))
head(gii)
##   GII_Rank     Country   GII Mortality_R Adolescent_Birth_R
## 1        1      Norway 0.067           4                7.8
## 2        2   Australia 0.110           6               12.1
## 3        3 Switzerland 0.028           6                1.9
## 4        4     Denmark 0.048           5                5.1
## 5        5 Netherlands 0.062           6                6.2
## 6        6     Germany 0.041           7                3.8
##   Representation_Parliament Sec_Edu_Fem Sec_Edu_Mal Lab_Forc_Particip_R_Fem
## 1                      39.6        97.4        96.7                    61.2
## 2                      30.5        94.3        94.6                    58.8
## 3                      28.5        95.0        96.6                    61.8
## 4                      38.0        95.5        96.6                    58.7
## 5                      36.9        87.7        90.5                    58.5
## 6                      36.9        96.3        97.0                    53.6
##   Lab_Forc_Particip_R_Mal     edu_R
## 1                    68.7 1.0072389
## 2                    71.8 0.9968288
## 3                    74.9 0.9834369
## 4                    66.4 0.9886128
## 5                    70.6 0.9690608
## 6                    66.4 0.9927835

Extra coding 1: alternative to mutate function

instead of using mutate, this is also possible:

gii$edu_R <- gii$Sec_Edu_Fem/gii$Sec_Edu_Mal

gii <- mutate(gii, Lab_Forc_R = (Lab_Forc_Particip_R_Fem/Lab_Forc_Particip_R_Mal))
head(gii)
##   GII_Rank     Country   GII Mortality_R Adolescent_Birth_R
## 1        1      Norway 0.067           4                7.8
## 2        2   Australia 0.110           6               12.1
## 3        3 Switzerland 0.028           6                1.9
## 4        4     Denmark 0.048           5                5.1
## 5        5 Netherlands 0.062           6                6.2
## 6        6     Germany 0.041           7                3.8
##   Representation_Parliament Sec_Edu_Fem Sec_Edu_Mal Lab_Forc_Particip_R_Fem
## 1                      39.6        97.4        96.7                    61.2
## 2                      30.5        94.3        94.6                    58.8
## 3                      28.5        95.0        96.6                    61.8
## 4                      38.0        95.5        96.6                    58.7
## 5                      36.9        87.7        90.5                    58.5
## 6                      36.9        96.3        97.0                    53.6
##   Lab_Forc_Particip_R_Mal     edu_R Lab_Forc_R
## 1                    68.7 1.0072389  0.8908297
## 2                    71.8 0.9968288  0.8189415
## 3                    74.9 0.9834369  0.8251001
## 4                    66.4 0.9886128  0.8840361
## 5                    70.6 0.9690608  0.8286119
## 6                    66.4 0.9927835  0.8072289

Task 6 .Join together the two datasets using the variable Country as the identifier
TIP we use inner join to keep only observations in both data sets

hd_gii_joint <- inner_join(hd, gii, 
                           by = 'Country', suffix = c("_hd", "_gi"))

head(hd_gii_joint)
##   HDI_Rank     Country   HDI Life_Expec_Birth Expec_yr_Edu Mean_yr_Edu
## 1        1      Norway 0.944             81.6         17.5        12.6
## 2        2   Australia 0.935             82.4         20.2        13.0
## 3        3 Switzerland 0.930             83.0         15.8        12.8
## 4        4     Denmark 0.923             80.2         18.7        12.7
## 5        5 Netherlands 0.922             81.6         17.9        11.9
## 6        6     Germany 0.916             80.9         16.5        13.1
##   GNI_per_Cap GNI_per_Cap_Min_HDI_Rank GII_Rank   GII Mortality_R
## 1      64,992                        5        1 0.067           4
## 2      42,261                       17        2 0.110           6
## 3      56,431                        6        3 0.028           6
## 4      44,025                       11        4 0.048           5
## 5      45,435                        9        5 0.062           6
## 6      43,919                       11        6 0.041           7
##   Adolescent_Birth_R Representation_Parliament Sec_Edu_Fem Sec_Edu_Mal
## 1                7.8                      39.6        97.4        96.7
## 2               12.1                      30.5        94.3        94.6
## 3                1.9                      28.5        95.0        96.6
## 4                5.1                      38.0        95.5        96.6
## 5                6.2                      36.9        87.7        90.5
## 6                3.8                      36.9        96.3        97.0
##   Lab_Forc_Particip_R_Fem Lab_Forc_Particip_R_Mal     edu_R Lab_Forc_R
## 1                    61.2                    68.7 1.0072389  0.8908297
## 2                    58.8                    71.8 0.9968288  0.8189415
## 3                    61.8                    74.9 0.9834369  0.8251001
## 4                    58.7                    66.4 0.9886128  0.8840361
## 5                    58.5                    70.6 0.9690608  0.8286119
## 6                    53.6                    66.4 0.9927835  0.8072289
dim(hd_gii_joint)
## [1] 195  19
str(hd_gii_joint)
## 'data.frame':    195 obs. of  19 variables:
##  $ HDI_Rank                 : int  1 2 3 4 5 6 6 8 9 9 ...
##  $ Country                  : chr  "Norway" "Australia" "Switzerland" "Denmark" ...
##  $ HDI                      : num  0.944 0.935 0.93 0.923 0.922 0.916 0.916 0.915 0.913 0.913 ...
##  $ Life_Expec_Birth         : num  81.6 82.4 83 80.2 81.6 80.9 80.9 79.1 82 81.8 ...
##  $ Expec_yr_Edu             : num  17.5 20.2 15.8 18.7 17.9 16.5 18.6 16.5 15.9 19.2 ...
##  $ Mean_yr_Edu              : num  12.6 13 12.8 12.7 11.9 13.1 12.2 12.9 13 12.5 ...
##  $ GNI_per_Cap              : chr  "64,992" "42,261" "56,431" "44,025" ...
##  $ GNI_per_Cap_Min_HDI_Rank : int  5 17 6 11 9 11 16 3 11 23 ...
##  $ GII_Rank                 : int  1 2 3 4 5 6 6 8 9 9 ...
##  $ GII                      : num  0.067 0.11 0.028 0.048 0.062 0.041 0.113 0.28 0.129 0.157 ...
##  $ Mortality_R              : int  4 6 6 5 6 7 9 28 11 8 ...
##  $ Adolescent_Birth_R       : num  7.8 12.1 1.9 5.1 6.2 3.8 8.2 31 14.5 25.3 ...
##  $ Representation_Parliament: num  39.6 30.5 28.5 38 36.9 36.9 19.9 19.4 28.2 31.4 ...
##  $ Sec_Edu_Fem              : num  97.4 94.3 95 95.5 87.7 96.3 80.5 95.1 100 95 ...
##  $ Sec_Edu_Mal              : num  96.7 94.6 96.6 96.6 90.5 97 78.6 94.8 100 95.3 ...
##  $ Lab_Forc_Particip_R_Fem  : num  61.2 58.8 61.8 58.7 58.5 53.6 53.1 56.3 61.6 62 ...
##  $ Lab_Forc_Particip_R_Mal  : num  68.7 71.8 74.9 66.4 70.6 66.4 68.1 68.9 71 73.8 ...
##  $ edu_R                    : num  1.007 0.997 0.983 0.989 0.969 ...
##  $ Lab_Forc_R               : num  0.891 0.819 0.825 0.884 0.829 ...
#setwd("./data")
write.csv(hd_gii_joint, "./data/human.csv")

Data wrangling (for this week)

Now in this week, lets continue to data wrangling and analysis in theme of dimensionality reduction

human <- read.csv("./data/human.csv")
#altrnatively, you can read directly from web:
human <- read.table("http://s3.amazonaws.com/assets.datacamp.com/production/course_2218/datasets/human1.txt", sep  =",", header = T)

# check column names of the data
names(human)
##  [1] "HDI.Rank"       "Country"        "HDI"            "Life.Exp"      
##  [5] "Edu.Exp"        "Edu.Mean"       "GNI"            "GNI.Minus.Rank"
##  [9] "GII.Rank"       "GII"            "Mat.Mor"        "Ado.Birth"     
## [13] "Parli.F"        "Edu2.F"         "Edu2.M"         "Labo.F"        
## [17] "Labo.M"         "Edu2.FM"        "Labo.FM"
#dimention of dataset
dim(human)
## [1] 195  19

Data description:
This data originates from the United Nations Development Programme. It gives us information about Human Development status in different countries by different variables.
The data combines several indicators from most countries in the world. More info in http://hdr.undp.org/en/content/human-development-index-hdi

The variables names and their descriptions are as following: “Country” = Country name

Health and knowledge
“GNI” = Gross National Income per capita
“Life.Exp” = Life expectancy at birth
“Edu.Exp” = Expected years of schooling
“Mat.Mor” = Maternal mortality ratio
“Ado.Birth” = Adolescent birth rate

Empowerment
“Parli.F” = Percentage of female representatives in parliament
“Edu2.F” = Proportion of females with at least secondary education
“Edu2.M” = Proportion of males with at least secondary education
“Labo.F” = Proportion of females in the labour force
“Labo.M” " Proportion of males in the labour force

“Edu2.FM” = Edu2.F / Edu2.M
“Labo.FM” = Labo2.F / Labo2.M"

NOTE: Since I personally do not like dots (.) to be in the file or columns names, I like to change it to underscore (_)

colnames(human) <- c("HDI_Rank", "Country", "HDI", "Life_Exp", 
                   "Edu_Exp", "Edu_Mean", "GNI", "GNI_Minus_Rank",
                   "GII_Rank","GII","Mat_Mor","Ado_Birth",
                   "Parli_F","Edu2_F","Edu2_M","Labo_F",
                   "Labo_M","Edu2_FM","Labo_FM")
#Source: https://raw.githubusercontent.com/TuomoNieminen/Helsinki-Open-Data-Science/master/datasets/human_meta.txt  

# Check Column names again that changed successfully
colnames(human)
##  [1] "HDI_Rank"       "Country"        "HDI"            "Life_Exp"      
##  [5] "Edu_Exp"        "Edu_Mean"       "GNI"            "GNI_Minus_Rank"
##  [9] "GII_Rank"       "GII"            "Mat_Mor"        "Ado_Birth"     
## [13] "Parli_F"        "Edu2_F"         "Edu2_M"         "Labo_F"        
## [17] "Labo_M"         "Edu2_FM"        "Labo_FM"
# look at the structure of human
str(human)
## 'data.frame':    195 obs. of  19 variables:
##  $ HDI_Rank      : int  1 2 3 4 5 6 6 8 9 9 ...
##  $ Country       : chr  "Norway" "Australia" "Switzerland" "Denmark" ...
##  $ HDI           : num  0.944 0.935 0.93 0.923 0.922 0.916 0.916 0.915 0.913 0.913 ...
##  $ Life_Exp      : num  81.6 82.4 83 80.2 81.6 80.9 80.9 79.1 82 81.8 ...
##  $ Edu_Exp       : num  17.5 20.2 15.8 18.7 17.9 16.5 18.6 16.5 15.9 19.2 ...
##  $ Edu_Mean      : num  12.6 13 12.8 12.7 11.9 13.1 12.2 12.9 13 12.5 ...
##  $ GNI           : chr  "64,992" "42,261" "56,431" "44,025" ...
##  $ GNI_Minus_Rank: int  5 17 6 11 9 11 16 3 11 23 ...
##  $ GII_Rank      : int  1 2 3 4 5 6 6 8 9 9 ...
##  $ GII           : num  0.067 0.11 0.028 0.048 0.062 0.041 0.113 0.28 0.129 0.157 ...
##  $ Mat_Mor       : int  4 6 6 5 6 7 9 28 11 8 ...
##  $ Ado_Birth     : num  7.8 12.1 1.9 5.1 6.2 3.8 8.2 31 14.5 25.3 ...
##  $ Parli_F       : num  39.6 30.5 28.5 38 36.9 36.9 19.9 19.4 28.2 31.4 ...
##  $ Edu2_F        : num  97.4 94.3 95 95.5 87.7 96.3 80.5 95.1 100 95 ...
##  $ Edu2_M        : num  96.7 94.6 96.6 96.6 90.5 97 78.6 94.8 100 95.3 ...
##  $ Labo_F        : num  61.2 58.8 61.8 58.7 58.5 53.6 53.1 56.3 61.6 62 ...
##  $ Labo_M        : num  68.7 71.8 74.9 66.4 70.6 66.4 68.1 68.9 71 73.8 ...
##  $ Edu2_FM       : num  1.007 0.997 0.983 0.989 0.969 ...
##  $ Labo_FM       : num  0.891 0.819 0.825 0.884 0.829 ...
# print out summaries of the variables
summary(human)
##     HDI_Rank        Country               HDI            Life_Exp    
##  Min.   :  1.00   Length:195         Min.   :0.3480   Min.   :49.00  
##  1st Qu.: 47.75   Class :character   1st Qu.:0.5770   1st Qu.:65.75  
##  Median : 94.00   Mode  :character   Median :0.7210   Median :73.10  
##  Mean   : 94.31                      Mean   :0.6918   Mean   :71.07  
##  3rd Qu.:141.25                      3rd Qu.:0.8000   3rd Qu.:76.80  
##  Max.   :188.00                      Max.   :0.9440   Max.   :84.00  
##  NA's   :7                                                           
##     Edu_Exp         Edu_Mean          GNI            GNI_Minus_Rank    
##  Min.   : 4.10   Min.   : 1.400   Length:195         Min.   :-84.0000  
##  1st Qu.:11.10   1st Qu.: 5.550   Class :character   1st Qu.: -9.0000  
##  Median :13.10   Median : 8.400   Mode  :character   Median :  1.5000  
##  Mean   :12.86   Mean   : 8.079                      Mean   :  0.1862  
##  3rd Qu.:14.90   3rd Qu.:10.600                      3rd Qu.: 11.0000  
##  Max.   :20.20   Max.   :13.100                      Max.   : 47.0000  
##                                                      NA's   :7         
##     GII_Rank           GII            Mat_Mor         Ado_Birth     
##  Min.   :  1.00   Min.   :0.0160   Min.   :   1.0   Min.   :  0.60  
##  1st Qu.: 47.75   1st Qu.:0.2030   1st Qu.:  16.0   1st Qu.: 15.45  
##  Median : 94.00   Median :0.3935   Median :  69.0   Median : 40.95  
##  Mean   : 94.31   Mean   :0.3695   Mean   : 163.2   Mean   : 49.55  
##  3rd Qu.:141.25   3rd Qu.:0.5272   3rd Qu.: 230.0   3rd Qu.: 71.78  
##  Max.   :188.00   Max.   :0.7440   Max.   :1100.0   Max.   :204.80  
##  NA's   :7        NA's   :33       NA's   :10       NA's   :5       
##     Parli_F          Edu2_F          Edu2_M           Labo_F     
##  Min.   : 0.00   Min.   :  0.9   Min.   :  3.20   Min.   :13.50  
##  1st Qu.:12.47   1st Qu.: 27.8   1st Qu.: 38.30   1st Qu.:44.50  
##  Median :19.50   Median : 55.7   Median : 60.00   Median :53.30  
##  Mean   :20.60   Mean   : 54.8   Mean   : 60.29   Mean   :52.61  
##  3rd Qu.:27.02   3rd Qu.: 81.8   3rd Qu.: 85.80   3rd Qu.:62.62  
##  Max.   :57.50   Max.   :100.0   Max.   :100.00   Max.   :88.10  
##  NA's   :3       NA's   :26      NA's   :26       NA's   :11     
##      Labo_M         Edu2_FM          Labo_FM      
##  Min.   :44.20   Min.   :0.1717   Min.   :0.1857  
##  1st Qu.:68.88   1st Qu.:0.7284   1st Qu.:0.5987  
##  Median :75.55   Median :0.9349   Median :0.7514  
##  Mean   :74.74   Mean   :0.8541   Mean   :0.7038  
##  3rd Qu.:80.15   3rd Qu.:0.9968   3rd Qu.:0.8513  
##  Max.   :95.50   Max.   :1.4967   Max.   :1.0380  
##  NA's   :11      NA's   :26       NA's   :11

The summary shows the minimum, maximum, mean, and 1st and 3rd quartiles of the data together with median. It also shows if there as NA cells in our columns. Moreover, we can see that all variables, except Country and GNI index, are numerical

Task 1 : Mutate the data: transform the Gross National Income (GNI) variable to numeric"

# access the stringr package
library(stringr)

# look at the structure of the GNI column in 'human'
str(human$GNI)
##  chr [1:195] "64,992" "42,261" "56,431" "44,025" "45,435" "43,919" "39,568" ...
class(human$GNI)
## [1] "character"

As you can see, this GNI column is not numerical (is character class), so we shall change it as numeric.

# remove the commas from GNI and print out a numeric version of it
human$GNI <- str_replace(human$GNI, pattern=",", replace ="") %>% as.numeric

str(human$GNI)
##  num [1:195] 64992 42261 56431 44025 45435 ...
class(human$GNI)
## [1] "numeric"

Task 2: keep only the columns matching the following variable names"

# columns to keep
keep <- c("Country", "Edu2_FM", "Labo_FM", "Life_Exp", "Edu_Exp", 
          "GNI", "Mat_Mor", "Ado_Birth", "Parli_F")

# select the 'keep' columns
human <- dplyr::select(human, one_of(keep))

length(keep)
## [1] 9
ncol(human)
## [1] 9

So the number of columns of dataset is now equal to the number of variables we like to keep.

Task 3 : Remove all rows with missing values

# print out a completeness indicator of the 'human' data
complete.cases(human)
##   [1]  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE FALSE
##  [13] FALSE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE
##  [25]  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE FALSE  TRUE  TRUE FALSE  TRUE  TRUE
##  [37]  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE
##  [49]  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE FALSE  TRUE FALSE
##  [61]  TRUE  TRUE  TRUE FALSE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE
##  [73]  TRUE  TRUE  TRUE  TRUE FALSE  TRUE FALSE  TRUE  TRUE  TRUE  TRUE  TRUE
##  [85]  TRUE  TRUE  TRUE  TRUE FALSE  TRUE  TRUE  TRUE  TRUE FALSE  TRUE  TRUE
##  [97]  TRUE FALSE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE
## [109] FALSE  TRUE  TRUE  TRUE FALSE FALSE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE
## [121]  TRUE FALSE FALSE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE
## [133] FALSE  TRUE FALSE  TRUE FALSE FALSE  TRUE  TRUE FALSE  TRUE  TRUE FALSE
## [145]  TRUE  TRUE  TRUE  TRUE FALSE  TRUE  TRUE FALSE  TRUE FALSE  TRUE  TRUE
## [157] FALSE  TRUE FALSE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE FALSE
## [169] FALSE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE FALSE  TRUE  TRUE
## [181]  TRUE FALSE  TRUE  TRUE  TRUE FALSE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE
## [193]  TRUE  TRUE  TRUE

complete.cases function Return a logical vector indicating which cases are complete, i.e., have no missing values

# print out the data along with a completeness indicator as the last column
data.frame(human[-1], comp = complete.cases(human))
##       Edu2_FM   Labo_FM Life_Exp Edu_Exp    GNI Mat_Mor Ado_Birth Parli_F  comp
## 1   1.0072389 0.8908297     81.6    17.5  64992       4       7.8    39.6  TRUE
## 2   0.9968288 0.8189415     82.4    20.2  42261       6      12.1    30.5  TRUE
## 3   0.9834369 0.8251001     83.0    15.8  56431       6       1.9    28.5  TRUE
## 4   0.9886128 0.8840361     80.2    18.7  44025       5       5.1    38.0  TRUE
## 5   0.9690608 0.8286119     81.6    17.9  45435       6       6.2    36.9  TRUE
## 6   0.9927835 0.8072289     80.9    16.5  43919       7       3.8    36.9  TRUE
## 7   1.0241730 0.7797357     80.9    18.6  39568       9       8.2    19.9  TRUE
## 8   1.0031646 0.8171263     79.1    16.5  52947      28      31.0    19.4  TRUE
## 9   1.0000000 0.8676056     82.0    15.9  42155      11      14.5    28.2  TRUE
## 10  0.9968520 0.8401084     81.8    19.2  32689       8      25.3    31.4  TRUE
## 11  0.9148148 0.7616580     83.0    15.4  76628       6       6.0    25.3  TRUE
## 12  0.9116162 0.7566372     84.0    15.6  53959      NA       3.3      NA FALSE
## 13         NA        NA     80.0    15.0  79851      NA        NA    20.0 FALSE
## 14  0.9908362 0.8880707     82.2    15.8  45636       4       6.5    43.6  TRUE
## 15  0.9989990 0.8107715     80.7    16.2  39267       8      25.8    23.5  TRUE
## 16  0.9934498 0.9108527     82.6    19.0  35182       4      11.5    41.3  TRUE
## 17  0.8641975 0.6948682     81.9    16.9  33890      27       2.2    16.3  TRUE
## 18  0.9667812 0.8379161     82.4    16.0  30676       2       7.8    22.5  TRUE
## 19  1.0000000 0.7848297     81.7    13.9  58711      11       8.3    28.3  TRUE
## 20  1.0139860 0.6931818     83.5    15.3  36927       6       5.4    11.6  TRUE
## 21  0.9348613 0.8010118     80.8    16.3  41187       6       6.7    42.4  TRUE
## 22  0.9375000 0.8230519     82.2    16.0  38056      12       5.7    25.7  TRUE
## 23  1.0000000 0.8064993     81.4    15.7  43869       4       4.1    30.3  TRUE
## 24  1.0000000 0.8703125     80.8    17.1  38695       4       9.2    42.5  TRUE
## 25  0.9775510 0.8275316     80.4    16.8  27852       7       0.6    27.7  TRUE
## 26  0.9138167 0.7978723     82.6    17.3  32045       4      10.6    38.0  TRUE
## 27  0.8844720 0.6655462     83.1    16.0  33030       4       4.0    30.1  TRUE
## 28  1.0020060 0.7481698     78.6    16.4  26660       5       4.9    18.9  TRUE
## 29  0.8880597 0.7072000     80.9    17.6  24524       5      11.9    21.0  TRUE
## 30  1.0000000 0.8156749     76.8    16.5  25214      11      16.8    19.8  TRUE
## 31  0.9424779 0.6985392     78.8    14.5  72570      27      23.0      NA FALSE
## 32  0.9302326 0.7876231     80.2    14.0  28633      10       5.5    12.5  TRUE
## 33  1.1305085 0.5319372     78.2    13.8 123124       6       9.5     0.0  TRUE
## 34  1.0040568        NA     81.3    13.5  43978      NA        NA    50.0 FALSE
## 35  0.9959799 0.7448980     76.3    15.1  25845       7      15.9    18.7  TRUE
## 36  0.9286550 0.7534669     77.4    15.5  23177       3      12.2    22.1  TRUE
## 37  0.9448568 0.8291233     73.3    16.4  24500      11      10.6    23.4  TRUE
## 38  0.8772379 0.5716440     80.6    14.4  27930       9      18.2    13.0  TRUE
## 39  0.8605974 0.2579821     74.3    16.3  52821      16      10.2    19.9  TRUE
## 40  0.9774306 0.6333333     76.3    17.9  22050      69      54.4    36.8  TRUE
## 41  1.1944444 0.5054348     77.0    13.3  60868       8      27.6    17.5  TRUE
## 42  0.9594241 0.6577540     81.7    15.2  21290      22      55.3    15.8  TRUE
## 43  0.9896266 0.8293051     80.9    16.3  25757       8      12.6    31.3  TRUE
## 44  0.9918946 0.7466667     75.2    15.4  22916      14      12.1    10.1  TRUE
## 45  1.1031128 0.4510932     76.6    14.4  38599      22      13.8    15.0  TRUE
## 46  0.9989899 0.8121302     74.2    15.2  22281      13      13.5    18.0  TRUE
## 47  0.9081197 0.7654110     77.3    14.8  19409      13      12.7    25.8  TRUE
## 48  0.9875666 0.5246691     74.4    14.7  83961      14      14.5     1.5  TRUE
## 49  0.8891235 0.7504363     76.2    15.2  14558       7      15.2    17.3  TRUE
## 50  0.9436009 0.7939778     71.3    15.7  16676       1      20.6    30.1  TRUE
## 51  0.9686486 0.7963738     70.1    14.7  22352      24      25.7    14.5  TRUE
## 52  0.8266200 0.3510896     76.8    13.6  34858      11      10.6     9.6  TRUE
## 53  0.9358696 0.7503852     74.7    14.2  18108      33      31.0    12.0  TRUE
## 54  1.0815109 0.7239583     77.2    15.5  19283      14      58.3    11.5  TRUE
## 55  1.0410959 0.8738966     75.4    12.6  21336      37      28.5    16.7  TRUE
## 56  0.9645749 0.8690629     69.4    15.0  20867      26      29.9    20.1  TRUE
## 57  1.0205245 0.8603133     75.6    15.4  12488      52      48.4    19.6  TRUE
## 58         NA        NA     76.1    14.0  20070      NA      49.3    25.7 FALSE
## 59  0.9717868 0.8118644     74.2    14.4  15596       5      35.9    20.4  TRUE
## 60         NA        NA     72.7    13.7  13496      NA        NA    10.3 FALSE
## 61  1.0821643 0.5990220     77.6    13.3  18192      85      78.5    19.3  TRUE
## 62  0.9130435 0.5880795     74.7    12.7  22762      29       5.7    14.2  TRUE
## 63  0.8517241 0.5876011     74.4    15.6  17470      73      30.9    11.6  TRUE
## 64  1.0045045        NA     73.1    13.4  23300      NA      56.3    43.8 FALSE
## 65  0.9802956 0.7019868     70.4    12.3  26090      84      34.8    24.7  TRUE
## 66  0.7934783 0.7307061     74.9    14.4  12190      16      16.9    34.0  TRUE
## 67  0.9428934 0.6200000     79.4    13.8   7301      80      43.1    48.9  TRUE
## 68  0.9566787 0.3286319     79.3    13.8  16509      16      12.0     3.1  TRUE
## 69  1.0039604 0.5898734     79.4    13.9  13413      38      60.8    33.3  TRUE
## 70  0.9201183 0.2255435     75.4    15.1  15440      23      31.6     3.1  TRUE
## 71  1.1141732 0.6452020     74.2    14.2  16159     110      83.2    17.0  TRUE
## 72  0.6500000 0.4152542     75.3    14.5  18677      20      30.9    14.4  TRUE
## 73  0.9515707 0.4600262     74.9    13.7   9779      29      16.9     5.8  TRUE
## 74  0.9191419 0.5644556     76.8    13.1  16056      49      63.4    37.1  TRUE
## 75  1.0419847 0.7351485     74.5    15.2  15175      69      70.8     9.6  TRUE
## 76  0.9676375 0.7523302     74.9    13.8   7164      41      46.8    11.3  TRUE
## 77         NA        NA     73.8    12.9  20805      NA        NA     6.7 FALSE
## 78  0.9620123 0.9037356     70.8    11.9  16428      26      40.0    15.6  TRUE
## 79         NA        NA     73.4    15.8  10939      23      35.4    25.0 FALSE
## 80  0.8853503 0.2342342     74.0    13.5  11365      50      26.5    11.6  TRUE
## 81  0.7230216 0.6385185     75.4    13.4  11780       7      18.3    33.3  TRUE
## 82  0.9562044 0.7952167     71.0    15.1   8178      23      25.7    11.8  TRUE
## 83  0.8612903 0.2105263     74.8    14.0  13054      89      10.0    25.7  TRUE
## 84  0.8517398 0.8080569     74.6    13.1  11015      89      50.7    22.3  TRUE
## 85  0.9306030 0.6854962     77.8    11.8   9943      21      15.3    20.7  TRUE
## 86  0.9894737 0.7465565     74.7    12.3   8124      29      27.1    10.7  TRUE
## 87  0.6432665 0.5951134     76.5    13.6   9638       8      15.1    19.3  TRUE
## 88  1.0177665 0.6614268     75.9    14.2  10605      87      77.0    41.6  TRUE
## 89         NA 0.8228346     75.1    12.6   9765      34      56.3    20.7 FALSE
## 90  0.8164117 0.8160920     75.8    13.1  12547      32       8.6    23.6  TRUE
## 91  0.9953488 0.5208333     70.0    15.7   7493      59      42.8    14.0  TRUE
## 92  1.0142687 0.8167388     69.4    14.6  10729      68      18.7    14.9  TRUE
## 93  0.8750000 0.7967782     74.4    13.5  13323      26      41.0     6.1  TRUE
## 94  1.2801724        NA     77.8    12.7   9994      NA        NA    21.9 FALSE
## 95  1.3245823 0.3926702     71.6    14.0  14911      15       2.5    16.0  TRUE
## 96  0.7114967 0.3540197     74.8    14.6  10404      46       4.6    31.3  TRUE
## 97  1.0233813 0.7001255     74.0    13.5  12040      83      68.5    20.9  TRUE
## 98         NA 0.7141026     72.9    13.4   9937      45      54.5    13.0 FALSE
## 99  1.0541311 0.7912553     75.7    12.4   7415      80      70.1    16.7  TRUE
## 100 0.9909400 0.7171582     72.8    14.7   5069     120      18.1     0.0  TRUE
## 101 1.0079156 0.5978129     70.0    13.6   7614      45      71.4    13.3  TRUE
## 102 1.0470810 0.6526718     73.5    13.1  11883     100      99.6    19.1  TRUE
## 103 0.9469214 0.5886628     71.1    12.7  15617     130      35.2    11.8  TRUE
## 104 0.8348624 0.7251613     76.8    13.0  12328      31       4.2     5.9  TRUE
## 105 1.0716667 0.4023973     73.4    12.9   5327      58      28.3     6.1  TRUE
## 106 0.9448010 0.8811275     64.5    12.5  16646     170      44.2     9.5  TRUE
## 107 0.9689441 0.8506787     71.6    11.9   5223      21      29.3    20.8  TRUE
## 108 0.7244224 0.3168449     71.1    13.5  10512      45      43.0     2.2  TRUE
## 109        NA 0.6098830     65.6    10.8  13066      61      18.0    25.8 FALSE
## 110 1.4930748 0.8593272     64.4    12.5  16367     240     103.0    16.2  TRUE
## 111 0.8109756 0.6104513     68.9    13.0   9788     190      48.3    17.1  TRUE
## 112 0.8558140 0.6568396     72.9    11.9   7643     110      67.0    16.8  TRUE
## 113 0.9074074 0.2319277     72.9    13.0   4699      NA      45.8      NA FALSE
## 114        NA 0.6362434     68.4    11.5   5567      36      38.8    16.4 FALSE
## 115 1.0345369 0.6411543     68.2    11.3   7915     120      46.8    27.1  TRUE
## 116 0.8440367 0.6050633     73.0    12.3   7349      69      76.0    27.4  TRUE
## 117 0.9578393 0.7355372     57.4    13.6  12122     140      50.9    40.7  TRUE
## 118 0.8342697 0.8880779     75.8    11.9   5092      49      29.0    24.3  TRUE
## 119 0.8054146 0.7935723     68.3    13.2   5760     200      71.9    51.8  TRUE
## 120 0.9762397 0.7044025     70.6    12.5   3044      75      29.3    23.3  TRUE
## 121 0.5537849 0.2134670     69.4    10.1  14003      67      68.7    26.5  TRUE
## 122        NA 0.6152927     73.3    13.5   6094      53      70.6    20.8 FALSE
## 123        NA        NA     69.1    11.7   3432      96      18.6     0.0 FALSE
## 124 1.2615063 0.5291925     66.4    10.3   6522     250      88.5    31.3  TRUE
## 125 1.0287206 0.5902864     74.9    11.5   4457     100     100.8    39.1  TRUE
## 126 0.6854305 0.3496042     74.0    11.6   6850     120      35.8    11.0  TRUE
## 127 0.9680233 0.8587127     64.8    11.3   9418     130      54.9    37.7  TRUE
## 128 0.9439655 0.5589569     71.8    10.7   6929     140      97.2    13.3  TRUE
## 129 1.0427632 0.7639429     69.4    11.2   2517      44      42.8    15.2  TRUE
## 130 0.4770318 0.3379224     68.0    11.7   5497     190      32.8    12.2  TRUE
## 131 1.0852713 0.5162847     73.1    11.1   3938     120      84.0    25.8  TRUE
## 132 0.9855072 0.8639896     69.5    12.6   7176     120      40.9     8.3  TRUE
## 133        NA 0.4842520     68.2    11.7   5363     270      52.2    38.5 FALSE
## 134 0.7283951 0.1856946     69.6    12.3   2728      49      41.6    12.4  TRUE
## 135        NA 0.7687500     71.9    10.6   2803      86      44.8     0.0 FALSE
## 136 0.8446809 0.9383562     62.3    11.1   6012     410     126.7    11.5  TRUE
## 137        NA        NA     66.0    12.3   2434     130      16.6     8.7 FALSE
## 138        NA 0.8752711     57.6     9.0  21056     290     112.6    19.7 FALSE
## 139 0.5863636 0.8539720     60.1    13.5   3734     280     125.4    12.7  TRUE
## 140 0.6986090 0.9425770     61.4    11.5   3852     380      58.4    10.9  TRUE
## 141 0.6189189 0.9646018     66.2    10.6   4680      NA      65.0    25.0 FALSE
## 142 0.8256659 0.6825208     71.6    10.0   3191     170      80.6    20.0  TRUE
## 143 0.4323144 0.9109827     68.4    10.9   2949     170      44.3    19.0  TRUE
## 144        NA 0.5822622     66.5    11.3   2918     210      65.1    18.2 FALSE
## 145 0.8057325 0.8591160     61.6    11.0   2762     400      93.6    20.8  TRUE
## 146 0.4633508 0.9173364     69.6    12.4   2311     190      73.7    29.5  TRUE
## 147 0.4186551 0.2967431     66.2     7.8   4866     170      27.3    19.7  TRUE
## 148 1.4967320 0.9137303     65.9     8.6   4608     200      12.1     4.7  TRUE
## 149        NA 0.8231469     52.3    11.4   6822     460     170.2    36.8 FALSE
## 150 0.8423077 0.6131285     49.0    11.3   5542     310      72.0    14.7  TRUE
## 151 0.5894737 0.9767184     65.0     9.2   2411     410     122.7    36.0  TRUE
## 152        NA 0.7566719     52.8     9.0   5341     560     119.6     6.6 FALSE
## 153 0.6103152 0.8307292     55.5    10.4   2803     590     115.8    27.1  TRUE
## 154        NA 0.9569061     65.1    10.3   1328     440     122.8    20.5 FALSE
## 155 0.7854839 0.9275362     57.5    10.9   1615     470      60.3    35.1  TRUE
## 156 0.3971292 0.3628319     63.1     8.5   3560     320      73.3    22.2  TRUE
## 157        NA 0.6759494     67.9     9.2   1540     130      64.9     2.0 FALSE
## 158 0.5241379 0.9527027     62.6     9.9   2463     220      62.1     2.7  TRUE
## 159        NA 0.4394507     63.3    11.5   1456     350      51.1     3.0 FALSE
## 160 0.3220974 0.3518006     63.8     9.2   3519     270      47.0     0.7  TRUE
## 161 1.1526316 0.8027211     49.8    11.1   3306     490      89.4    26.8  TRUE
## 162 0.3995037 0.9913899     59.7    12.2   1228     450      91.5    17.6  TRUE
## 163 0.6363636 0.8577465     62.8     8.7   1669     380      42.0     3.5  TRUE
## 164 0.9090909 1.0128957     64.2    10.3   1458     320      33.6    57.5  TRUE
## 165 0.6835821 0.9570707     58.5     9.8   1613     360     126.6    35.0  TRUE
## 166 0.4185185 0.8633461     59.6    11.1   1767     340      90.2     8.4  TRUE
## 167 0.6648352 0.4118421     63.5     7.0   3809     360      84.0    23.8  TRUE
## 168        NA 0.5361891     62.0     6.4   3276     230      18.6    12.7 FALSE
## 169        NA        NA     55.7     7.6   2332     730      75.3    24.3 FALSE
## 170 0.4675325 0.7500000     66.5     7.9   2188     320      94.4    42.7  TRUE
## 171 0.1979866 0.1987421     60.4     9.3   1885     400      86.8    27.6  TRUE
## 172 0.4651163 0.6437346     51.5     8.9   3171     720     130.3     9.2  TRUE
## 173 0.5138889 1.0380368     62.8    10.8    747     510     144.8    16.7  TRUE
## 174 0.4285714 0.8756999     64.1     8.5   1428     420      78.4    25.5  TRUE
## 175 0.5523810 0.8709288     60.2     8.8   1507     430     115.8     9.4  TRUE
## 176 0.3950617 0.9658470     58.7     9.8    680     730     135.3     8.2  TRUE
## 177 0.3918575 0.8981481     60.9     9.5    805     640     117.4    10.7  TRUE
## 178        NA 0.8687898     55.2     9.0   1362     560      99.3    13.7 FALSE
## 179 0.5099338 0.6240786     58.0     8.4   1583     550     175.6     9.5  TRUE
## 180 0.2258065 1.0326087     55.1     9.3   1123     480     137.8    39.6  TRUE
## 181 0.4608295 0.9521739     50.9     8.6   1780    1100     100.7    12.4  TRUE
## 182        NA 0.8378033     58.8     8.7   1096     650     131.0    21.9 FALSE
## 183 0.2812500 0.8566667     58.7     7.8   1591     400     115.4    13.3  TRUE
## 184 0.6385542 1.0158537     56.7    10.1    758     740      30.3    34.9  TRUE
## 185 0.1717172 0.8080808     51.6     7.4   2085     980     152.0    14.9  TRUE
## 186        NA 0.8908686     63.7     4.1   1130     380      65.3    22.0 FALSE
## 187 0.3782772 0.8531140     50.7     7.2    581     880      98.3    12.5  TRUE
## 188 0.3076923 0.4459309     61.4     5.4    908     630     204.8    13.3  TRUE
## 189 0.7289916 0.3081009     70.6    12.0  15722     155      45.4    14.0  TRUE
## 190 0.8250377 0.7884131     74.0    12.7  11449      72      21.2    18.7  TRUE
## 191 0.8784119 0.6514286     72.3    13.6  12791      28      30.8    19.0  TRUE
## 192 0.9836957 0.6729323     75.0    14.0  14242      85      68.3    27.0  TRUE
## 193 0.5329670 0.3711083     68.4    11.2   5605     183      38.7    17.5  TRUE
## 194 0.7015873 0.8537859     58.5     9.6   3363     506     109.7    22.5  TRUE
## 195 0.8333333 0.6558018     71.5    12.2  14301     210      47.4    21.8  TRUE
# filter out all rows with NA values
human_filterd <- filter(human, complete.cases(human))
dim(human_filterd)
## [1] 162   9
dim(human)
## [1] 195   9

So you can see that the original human dataset had 195 rows, an now has 162.

Extra coding 2: alternatively, you could do with the following code too

library(tidyverse)
## -- Attaching packages --------------------------------------- tidyverse 1.3.0 --
## <U+221A> tibble  3.0.4     <U+221A> purrr   0.3.4
## <U+221A> readr   1.4.0     <U+221A> forcats 0.5.0
## -- Conflicts ------------------------------------------ tidyverse_conflicts() --
## x plotly::filter() masks dplyr::filter(), stats::filter()
## x dplyr::lag()     masks stats::lag()
## x purrr::lift()    masks caret::lift()
## x plotly::select() masks MASS::select(), dplyr::select()
human_filterd_new = human %>% drop_na()
dim(human_filterd_new)
## [1] 162   9
#Or alternatively same 
human_ = human %>% drop_na(colnames(human))
dim(human_)
## [1] 162   9

Task 4: Remove the observations which relate to regions instead of countries

#look at the last 10 observations
tail(human_filterd, 10)
##                             Country   Edu2_FM   Labo_FM Life_Exp Edu_Exp   GNI
## 153                            Chad 0.1717172 0.8080808     51.6     7.4  2085
## 154        Central African Republic 0.3782772 0.8531140     50.7     7.2   581
## 155                           Niger 0.3076923 0.4459309     61.4     5.4   908
## 156                     Arab States 0.7289916 0.3081009     70.6    12.0 15722
## 157       East Asia and the Pacific 0.8250377 0.7884131     74.0    12.7 11449
## 158         Europe and Central Asia 0.8784119 0.6514286     72.3    13.6 12791
## 159 Latin America and the Caribbean 0.9836957 0.6729323     75.0    14.0 14242
## 160                      South Asia 0.5329670 0.3711083     68.4    11.2  5605
## 161              Sub-Saharan Africa 0.7015873 0.8537859     58.5     9.6  3363
## 162                           World 0.8333333 0.6558018     71.5    12.2 14301
##     Mat_Mor Ado_Birth Parli_F
## 153     980     152.0    14.9
## 154     880      98.3    12.5
## 155     630     204.8    13.3
## 156     155      45.4    14.0
## 157      72      21.2    18.7
## 158      28      30.8    19.0
## 159      85      68.3    27.0
## 160     183      38.7    17.5
## 161     506     109.7    22.5
## 162     210      47.4    21.8
# last indice we want to keep
last <- nrow(human_filterd) - 7

# choose everything until the last 7 observations
human_ <- human_filterd[1:last, ]

Task 5. Define the row names of the data by the country names

# add countries as rownames
rownames(human_filterd) <- human_filterd$Country

# remove the Country variable
human_ <- select(human_filterd, -Country)

# load library
library(GGally)
library(corrplot)
# visualize the 'human_' variables
ggpairs(human_)

As the graph shows, there is strong correlation between some variables such as ado_birth abd mat_mor and edu_exp. similarly, GNI and life_exp and edu_exp.

# compute the correlation matrix and visualize it with corrplot
cor(human_) %>% corrplot(type = "lower")

As graph shows, there is strong negative correlation between Mat_Mor and Life_Exp, which means by increase of mmat_mor, life_exp increases. Same for Life_Exp and Abo_birth.
Also, some positive correlation between abo_birth and Mat_Mor. Parli_F and labo_FM are not correlating with other variables.
TIP: by including ´(type = “lower”)` you set it to plot lower triangular part of correlation matirx.

write.csv(human_, "human_new.csv")

Part 2) Data analysis

human <- read.csv("human_new.csv")
# Alternatively, yu can read directly from web
human <- read.csv("http://s3.amazonaws.com/assets.datacamp.com/production/course_2218/datasets/human2.txt", sep = ",", stringsAsFactors = F)
dim(human)
## [1] 155   8

Task 1 : Show a graphical overview

# standardize the variables
human_std <- scale(human)

As center = True (by default), now, this scale is done now by cantering which is done by subtracting the column means (omitting NAs) of x from their corresponding columns.
We standardize the data, because PCA is sensitive to relative scaling od the orignal features.

# print out summaries of the standardized variables
summary(human_std)
##     Edu2.FM           Labo.FM           Edu.Exp           Life.Exp      
##  Min.   :-2.8189   Min.   :-2.6247   Min.   :-2.7378   Min.   :-2.7188  
##  1st Qu.:-0.5233   1st Qu.:-0.5484   1st Qu.:-0.6782   1st Qu.:-0.6425  
##  Median : 0.3503   Median : 0.2316   Median : 0.1140   Median : 0.3056  
##  Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 0.5958   3rd Qu.: 0.7350   3rd Qu.: 0.7126   3rd Qu.: 0.6717  
##  Max.   : 2.6646   Max.   : 1.6632   Max.   : 2.4730   Max.   : 1.4218  
##       GNI             Mat.Mor          Ado.Birth          Parli.F       
##  Min.   :-0.9193   Min.   :-0.6992   Min.   :-1.1325   Min.   :-1.8203  
##  1st Qu.:-0.7243   1st Qu.:-0.6496   1st Qu.:-0.8394   1st Qu.:-0.7409  
##  Median :-0.3013   Median :-0.4726   Median :-0.3298   Median :-0.1403  
##  Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 0.3712   3rd Qu.: 0.1932   3rd Qu.: 0.6030   3rd Qu.: 0.6127  
##  Max.   : 5.6890   Max.   : 4.4899   Max.   : 3.8344   Max.   : 3.1850
summary(human)
##     Edu2.FM          Labo.FM          Edu.Exp         Life.Exp    
##  Min.   :0.1717   Min.   :0.1857   Min.   : 5.40   Min.   :49.00  
##  1st Qu.:0.7264   1st Qu.:0.5984   1st Qu.:11.25   1st Qu.:66.30  
##  Median :0.9375   Median :0.7535   Median :13.50   Median :74.20  
##  Mean   :0.8529   Mean   :0.7074   Mean   :13.18   Mean   :71.65  
##  3rd Qu.:0.9968   3rd Qu.:0.8535   3rd Qu.:15.20   3rd Qu.:77.25  
##  Max.   :1.4967   Max.   :1.0380   Max.   :20.20   Max.   :83.50  
##       GNI            Mat.Mor         Ado.Birth         Parli.F     
##  Min.   :   581   Min.   :   1.0   Min.   :  0.60   Min.   : 0.00  
##  1st Qu.:  4198   1st Qu.:  11.5   1st Qu.: 12.65   1st Qu.:12.40  
##  Median : 12040   Median :  49.0   Median : 33.60   Median :19.30  
##  Mean   : 17628   Mean   : 149.1   Mean   : 47.16   Mean   :20.91  
##  3rd Qu.: 24512   3rd Qu.: 190.0   3rd Qu.: 71.95   3rd Qu.:27.95  
##  Max.   :123124   Max.   :1100.0   Max.   :204.80   Max.   :57.50

As you can see from the summary of scaled and not-scaled data, the scaled one is standardized based on mean of each column, so data are not heteroscedasticity effect. So, we are not ready for further analysis.

ggpairs(as.data.frame(human_std))

This visualises the data distribution for each column and comparing with other columns.

cor(as.data.frame(human_std)) %>% corrplot(type = "lower")

The correction matrix (same as above) is given here, same interpretation as in data wrangling part

Task 2 Perform principal component analysis (PCA) with the SVD method
As instructed, we do PCA on not standardized human data.

pca_human <- prcomp(human)
summary(pca_human)
## Importance of components:
##                              PC1      PC2   PC3   PC4   PC5   PC6    PC7    PC8
## Standard deviation     1.854e+04 185.5219 25.19 11.45 3.766 1.566 0.1912 0.1591
## Proportion of Variance 9.999e-01   0.0001  0.00  0.00 0.000 0.000 0.0000 0.0000
## Cumulative Proportion  9.999e-01   1.0000  1.00  1.00 1.000 1.000 1.0000 1.0000
# Draw a biplot displaying the observations by the first two principal components (PC1 coordinate in x-axis, PC2 coordinate in y-axis)
biplot(pca_human, choices = 1:2, cex = c(0.8, 1), col = c("grey40", "deeppink2"))
## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length =
## arrow.len): zero-length arrow is of indeterminate angle and so skipped

## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length =
## arrow.len): zero-length arrow is of indeterminate angle and so skipped

## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length =
## arrow.len): zero-length arrow is of indeterminate angle and so skipped

## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length =
## arrow.len): zero-length arrow is of indeterminate angle and so skipped

## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length =
## arrow.len): zero-length arrow is of indeterminate angle and so skipped

The arrows show the original variables (there are overlapping on each other).

Task 3 Perfom PCa with standardize variables

pca_human <- prcomp(human_std)
summary(pca_human)
## Importance of components:
##                           PC1    PC2     PC3     PC4     PC5     PC6     PC7
## Standard deviation     2.0708 1.1397 0.87505 0.77886 0.66196 0.53631 0.45900
## Proportion of Variance 0.5361 0.1624 0.09571 0.07583 0.05477 0.03595 0.02634
## Cumulative Proportion  0.5361 0.6984 0.79413 0.86996 0.92473 0.96069 0.98702
##                            PC8
## Standard deviation     0.32224
## Proportion of Variance 0.01298
## Cumulative Proportion  1.00000
# Draw a biplot displaying the observations by the first two principal components (PC1 coordinate in x-axis, PC2 coordinate in y-axis)
biplot(pca_human, choices = 1:2, cex = c(0.8, 1), col = c("grey40", "deeppink2"))

The arrows show the original variables

# rounded percentages of variance captured by each PC
pca_pr <- round(100*summary(pca_human)$importance[2,], digits = 1) 
pca_pr
##  PC1  PC2  PC3  PC4  PC5  PC6  PC7  PC8 
## 53.6 16.2  9.6  7.6  5.5  3.6  2.6  1.3
# create object pc_lab to be used as axis labels
pc_lab <- paste0(names(pca_pr), " (", pca_pr, "%)")

# draw a biplot
biplot(pca_human, cex = c(0.8, 1), col = c("grey40", "deeppink2"), xlab = pc_lab[1], ylab = pc_lab[2])

Interpretation In the biplot small angle between arrows means high correlation between two variables. Thus, as Parli_F and Edu.Exp abd life_exp has high angle, thus don’t have high correlation (as also shown earlier in above correlation graphs).
On the other hand, Edu.Exp and GNI and Edu2.FM has very narrow angle, thus very high correlation (again, explain in above correlation plots).
Additionally, we can see that Principal component 2 (PC2), in Y axis, has same direction to Parli_F and Labo_FM variables, thus, PC2 is contributing to that features (dimensions), while all other variables has contributing to PC1. More info in https://vimeo.com/204165420

As for PCA components printed by summary function, the results differ, which is as expected. Because we standardised the data to have not heteroscedasticity. Moreover, since the data were standardized, the Standard deviation and Proportion of Variance will be different with non-standardised dataset.
We shall remember the PCA is sensitive to scaling of variables.
To interpret PCA, we could look at Proportion of Variance and its cumulative proportion. For example, in the standardized data, the PCA shows that PCA1 contributes to 53.6% of the variance in the data, and PCA2 contributes to 16.3%. Comparing this result with non-standardised result, the data and results has changed which is as expected because there were some heterostasity in the data which CA is sensitive for. PCA1 already could contribute over 99% of variance which looks doubtful to me. Thus, standardising the data is very effective and helpful to use before PCA analysis and getting the dimension of data reduced!
It makes sense, because in actual phenomenon’s, Gross National Income per capita (GNI) is higher if Expected years of schooling (Edu_Exp) is higher, and therefore, Life expectancy (Life_Exp) will be higher.

Task 4: I personally think that using PCA is very useful because cool and very useful also in my own research in forest data science where I often have tens of variables. PCA 1 and PCA 2 together show/contribute to nearly 70% (53.6+16.2) of the variance in the original data.

Extra coding 3: Plot PCA in 3D

inspired from https://cran.r-project.org/web/packages/pca3d/vignettes/pca3d.pdf

library(pca3d)
gr<-factor(rownames(human_std))
#summary(gr)
pca3d(pca_human, group=gr)
## [1] 0.12143416 0.06251040 0.05134851
## Creating new device
snapshotPCA3d(file="first_plot.png")

feel free to zoom on it or rotate it.
Every item in 3D is an observation (country) in 3D space created by PC1, PC2 and PC3.

Multiple Correspondence Analysis

Task 5:
Let’s do Multiple Correspondence Analysis (MCA) using the Factominer package, which contains functions dedicated to multivariate explanatory data analysis. It contains for example methods (Multiple) Correspondence analysis, Multiple Factor analysis as well as PCA.
We will do only Multiple Correspondence Analysis (MCA) with this data.
In this part I am going to use the tea dataset. The dataset contains the answers of a questionnaire on tea consumption.

library(FactoMineR)
library(ggplot2)
data(tea)
View(tea)

# column names to keep in the dataset
keep_columns <- c("Tea", "How", "how", "sugar", "where", "lunch")

# select the 'keep_columns' to create a new dataset
tea_time <- dplyr::select(tea, one_of(keep_columns))

# look at the summaries and structure of the data
summary(tea_time)
##         Tea         How                      how           sugar    
##  black    : 74   alone:195   tea bag           :170   No.sugar:155  
##  Earl Grey:193   lemon: 33   tea bag+unpackaged: 94   sugar   :145  
##  green    : 33   milk : 63   unpackaged        : 36                 
##                  other:  9                                          
##                   where           lunch    
##  chain store         :192   lunch    : 44  
##  chain store+tea shop: 78   Not.lunch:256  
##  tea shop            : 30                  
## 
str(tea_time)
## 'data.frame':    300 obs. of  6 variables:
##  $ Tea  : Factor w/ 3 levels "black","Earl Grey",..: 1 1 2 2 2 2 2 1 2 1 ...
##  $ How  : Factor w/ 4 levels "alone","lemon",..: 1 3 1 1 1 1 1 3 3 1 ...
##  $ how  : Factor w/ 3 levels "tea bag","tea bag+unpackaged",..: 1 1 1 1 1 1 1 1 2 2 ...
##  $ sugar: Factor w/ 2 levels "No.sugar","sugar": 2 1 1 2 1 1 1 1 1 1 ...
##  $ where: Factor w/ 3 levels "chain store",..: 1 1 1 1 1 1 1 1 2 2 ...
##  $ lunch: Factor w/ 2 levels "lunch","Not.lunch": 2 2 2 2 2 2 2 2 2 2 ...

As you can see, the data include information on tea consumption, e.g. 74 of black tea, 193 Earl Grey and 33 green tea, and how people drink the tea etc. We are doing to plot them and see visally ( as following).

# column names to keep in the dataset
keep_columns <- c("Tea", "How", "how", "sugar", "where", "lunch")

# select the 'keep_columns' to create a new dataset
tea_time <- select(tea, one_of(keep_columns))

# look at the summaries and structure of the data
summary(tea_time)
##         Tea         How                      how           sugar    
##  black    : 74   alone:195   tea bag           :170   No.sugar:155  
##  Earl Grey:193   lemon: 33   tea bag+unpackaged: 94   sugar   :145  
##  green    : 33   milk : 63   unpackaged        : 36                 
##                  other:  9                                          
##                   where           lunch    
##  chain store         :192   lunch    : 44  
##  chain store+tea shop: 78   Not.lunch:256  
##  tea shop            : 30                  
## 
str(tea_time)
## 'data.frame':    300 obs. of  6 variables:
##  $ Tea  : Factor w/ 3 levels "black","Earl Grey",..: 1 1 2 2 2 2 2 1 2 1 ...
##  $ How  : Factor w/ 4 levels "alone","lemon",..: 1 3 1 1 1 1 1 3 3 1 ...
##  $ how  : Factor w/ 3 levels "tea bag","tea bag+unpackaged",..: 1 1 1 1 1 1 1 1 2 2 ...
##  $ sugar: Factor w/ 2 levels "No.sugar","sugar": 2 1 1 2 1 1 1 1 1 1 ...
##  $ where: Factor w/ 3 levels "chain store",..: 1 1 1 1 1 1 1 1 2 2 ...
##  $ lunch: Factor w/ 2 levels "lunch","Not.lunch": 2 2 2 2 2 2 2 2 2 2 ...
# visualize the dataset
gather(tea_time) %>% ggplot(aes(value)) + facet_wrap("key", scales = "free") + geom_bar() + theme(axis.text.x = element_text(angle = 45, hjust = 1, size = 8))
## Warning: attributes are not identical across measure variables;
## they will be dropped

So you can see the data that include information on tea consumption, multiple columns about how it was used, what type of tea where and when they drink it.

We use some of the columns for MCA analysis

mca <- MCA(tea_time, graph = FALSE)

#summary of the model
summary(mca)
## 
## Call:
## MCA(X = tea_time, graph = FALSE) 
## 
## 
## Eigenvalues
##                        Dim.1   Dim.2   Dim.3   Dim.4   Dim.5   Dim.6   Dim.7
## Variance               0.279   0.261   0.219   0.189   0.177   0.156   0.144
## % of var.             15.238  14.232  11.964  10.333   9.667   8.519   7.841
## Cumulative % of var.  15.238  29.471  41.435  51.768  61.434  69.953  77.794
##                        Dim.8   Dim.9  Dim.10  Dim.11
## Variance               0.141   0.117   0.087   0.062
## % of var.              7.705   6.392   4.724   3.385
## Cumulative % of var.  85.500  91.891  96.615 100.000
## 
## Individuals (the 10 first)
##                       Dim.1    ctr   cos2    Dim.2    ctr   cos2    Dim.3
## 1                  | -0.298  0.106  0.086 | -0.328  0.137  0.105 | -0.327
## 2                  | -0.237  0.067  0.036 | -0.136  0.024  0.012 | -0.695
## 3                  | -0.369  0.162  0.231 | -0.300  0.115  0.153 | -0.202
## 4                  | -0.530  0.335  0.460 | -0.318  0.129  0.166 |  0.211
## 5                  | -0.369  0.162  0.231 | -0.300  0.115  0.153 | -0.202
## 6                  | -0.369  0.162  0.231 | -0.300  0.115  0.153 | -0.202
## 7                  | -0.369  0.162  0.231 | -0.300  0.115  0.153 | -0.202
## 8                  | -0.237  0.067  0.036 | -0.136  0.024  0.012 | -0.695
## 9                  |  0.143  0.024  0.012 |  0.871  0.969  0.435 | -0.067
## 10                 |  0.476  0.271  0.140 |  0.687  0.604  0.291 | -0.650
##                       ctr   cos2  
## 1                   0.163  0.104 |
## 2                   0.735  0.314 |
## 3                   0.062  0.069 |
## 4                   0.068  0.073 |
## 5                   0.062  0.069 |
## 6                   0.062  0.069 |
## 7                   0.062  0.069 |
## 8                   0.735  0.314 |
## 9                   0.007  0.003 |
## 10                  0.643  0.261 |
## 
## Categories (the 10 first)
##                        Dim.1     ctr    cos2  v.test     Dim.2     ctr    cos2
## black              |   0.473   3.288   0.073   4.677 |   0.094   0.139   0.003
## Earl Grey          |  -0.264   2.680   0.126  -6.137 |   0.123   0.626   0.027
## green              |   0.486   1.547   0.029   2.952 |  -0.933   6.111   0.107
## alone              |  -0.018   0.012   0.001  -0.418 |  -0.262   2.841   0.127
## lemon              |   0.669   2.938   0.055   4.068 |   0.531   1.979   0.035
## milk               |  -0.337   1.420   0.030  -3.002 |   0.272   0.990   0.020
## other              |   0.288   0.148   0.003   0.876 |   1.820   6.347   0.102
## tea bag            |  -0.608  12.499   0.483 -12.023 |  -0.351   4.459   0.161
## tea bag+unpackaged |   0.350   2.289   0.056   4.088 |   1.024  20.968   0.478
## unpackaged         |   1.958  27.432   0.523  12.499 |  -1.015   7.898   0.141
##                     v.test     Dim.3     ctr    cos2  v.test  
## black                0.929 |  -1.081  21.888   0.382 -10.692 |
## Earl Grey            2.867 |   0.433   9.160   0.338  10.053 |
## green               -5.669 |  -0.108   0.098   0.001  -0.659 |
## alone               -6.164 |  -0.113   0.627   0.024  -2.655 |
## lemon                3.226 |   1.329  14.771   0.218   8.081 |
## milk                 2.422 |   0.013   0.003   0.000   0.116 |
## other                5.534 |  -2.524  14.526   0.197  -7.676 |
## tea bag             -6.941 |  -0.065   0.183   0.006  -1.287 |
## tea bag+unpackaged  11.956 |   0.019   0.009   0.000   0.226 |
## unpackaged          -6.482 |   0.257   0.602   0.009   1.640 |
## 
## Categorical variables (eta2)
##                      Dim.1 Dim.2 Dim.3  
## Tea                | 0.126 0.108 0.410 |
## How                | 0.076 0.190 0.394 |
## how                | 0.708 0.522 0.010 |
## sugar              | 0.065 0.001 0.336 |
## where              | 0.702 0.681 0.055 |
## lunch              | 0.000 0.064 0.111 |

Interpreation: MCA is another approach for dimensionality reduction. MCA works well with categerical variables, as PCA for numerical variables. #As summary show, the MCA gives us for example Eigenvalues, which include Variance, percentage of variance and cumulative percentage of variance for each dimension. As it shows, dimention 1, contains/contributes to 15.238 of variance in the data, Dim 2, 14% and Dim 3 nearly 12%. It gradually declines as I plot the in the following extra coding parts.
Additionally, next part of summary shows the individuals (10 first here) for each dimension, contribution of each dimension (ctr), and squared correlation of each dimension (cos2). For example, Dim1, which the coordinated is -0.298, has contribution of 0.106 and squared correlation is 0.086. Note that Dim here is equalling to each principal component (PC) in the PCA. So, both approached (MCA and PCA) we transform the original data.
The next part, shows Categories (the 10 first), has v-test values (like normal t-test) in addition to the same stuff explained above. v.test value for black tea is 4.677 which means the coordinates is significantly different from zero. all values below or above +/-1.96 means that.
Final part of the summary shows Categorical variables (eta2) which shows squared correlation between each variable and dimension. The closer to 1, the stronger the correlation is. For example, how category and Dim.1 has strong correlation (0.708), which lunch and Dim.1 has no correlation (0.0). For more info please check https://vimeo.com/204251158

Visualize MCA by biplot

plot(mca, invisible=c("ind"), habillage = "quali") 

Interpreation the plot shows the MCA factor map, which is made by plotting Dim1 and Dim2 in x and y axises, respectively. It made easy ro see the possible variables patterns. The shorter distance between the variables the more similar they are, e.g. No sugar and No lunch in middle of the graph, which makes a good sense :). But tea bag and unpacked are far from each other which shows dissimilarities between those.

Extra coding 4: other plotting options for multiple correspondence analysis

inspritred from; http://www.sthda.com/english/articles/31-principal-component-methods-in-r-practical-guide/114-mca-multiple-correspondence-analysis-in-r-essentials/

library("factoextra")
# Color by cos2 values: quality on the factor map
fviz_mca_var(mca, col.var = "cos2",
             gradient.cols = c("#00AFBB", "#E7B800", "#FC4E07"), 
             repel = TRUE, # Avoid text overlapping
             ggtheme = theme_minimal())

The graph shows the squared correlation (cos2) in the variables. Which is a new thig to this is the same graph as above.

Or yet another way of plotting

factoextra::fviz_screeplot(mca, addlabels = TRUE, ylim = c(0, 45))


Interpreation: We can see that by increasing the dimension, the percentage of explained variance is declining. It means that the first variables contribute to the variance of the data and make it kind of saturated, that for example 10th dimension has less to contribute to variance, because most of variance it already filled and shown by earlier dimensions.


(more chapters to be added similarly as we proceed with the course!)